A new fast food chain is seeing rapid expansion over the past couple of years. They are now trying to optimize their supply chain to ensure that there are no shortages of ingredients. For this, they’ve tasked the data science team to come up with a model that could predict the output of each food processing farm over the next few years. These predictions could further increase the efficiency of their current supply chain management systems.
!pip install pystan~=2.14
!pip install fbprophet
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pystan~=2.14
Downloading pystan-2.19.1.1-cp38-cp38-manylinux1_x86_64.whl (62.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 MB 13.2 MB/s eta 0:00:00
Requirement already satisfied: Cython!=0.25.1,>=0.22 in /usr/local/lib/python3.8/dist-packages (from pystan~=2.14) (0.29.33)
Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.8/dist-packages (from pystan~=2.14) (1.22.4)
Installing collected packages: pystan
Attempting uninstall: pystan
Found existing installation: pystan 3.3.0
Uninstalling pystan-3.3.0:
Successfully uninstalled pystan-3.3.0
Successfully installed pystan-2.19.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fbprophet
Downloading fbprophet-0.7.1.tar.gz (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.0/64.0 KB 2.8 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: Cython>=0.22 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (0.29.33)
Collecting cmdstanpy==0.9.5
Downloading cmdstanpy-0.9.5-py3-none-any.whl (37 kB)
Requirement already satisfied: pystan>=2.14 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (2.19.1.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (1.22.4)
Requirement already satisfied: pandas>=1.0.4 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (1.3.5)
Requirement already satisfied: matplotlib>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (3.5.3)
Requirement already satisfied: LunarCalendar>=0.0.9 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (0.0.9)
Requirement already satisfied: convertdate>=2.1.2 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (2.4.0)
Requirement already satisfied: holidays>=0.10.2 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (0.19)
Collecting setuptools-git>=1.2
Downloading setuptools_git-1.2-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: python-dateutil>=2.8.0 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (2.8.2)
Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.8/dist-packages (from fbprophet) (4.64.1)
Requirement already satisfied: pymeeus<=1,>=0.3.13 in /usr/local/lib/python3.8/dist-packages (from convertdate>=2.1.2->fbprophet) (0.5.12)
Requirement already satisfied: korean-lunar-calendar in /usr/local/lib/python3.8/dist-packages (from holidays>=0.10.2->fbprophet) (0.3.1)
Requirement already satisfied: hijri-converter in /usr/local/lib/python3.8/dist-packages (from holidays>=0.10.2->fbprophet) (2.2.4)
Requirement already satisfied: pytz in /usr/local/lib/python3.8/dist-packages (from LunarCalendar>=0.0.9->fbprophet) (2022.7.1)
Requirement already satisfied: ephem>=3.7.5.3 in /usr/local/lib/python3.8/dist-packages (from LunarCalendar>=0.0.9->fbprophet) (4.1.4)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (4.38.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (7.1.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (23.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (1.4.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.0.0->fbprophet) (0.11.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.8/dist-packages (from python-dateutil>=2.8.0->fbprophet) (1.15.0)
Building wheels for collected packages: fbprophet
Building wheel for fbprophet (setup.py) ... done
Created wheel for fbprophet: filename=fbprophet-0.7.1-py3-none-any.whl size=9537315 sha256=9af1a648f4851f6f082bc68aefc924d6bae367b453f00aae34e576a82bf690de
Stored in directory: /root/.cache/pip/wheels/d0/d2/ae/c579b7fd160999d35908f3cb8ebcad7ef64ecaca7b78e4c3c8
Successfully built fbprophet
Installing collected packages: setuptools-git, cmdstanpy, fbprophet
Attempting uninstall: cmdstanpy
Found existing installation: cmdstanpy 1.1.0
Uninstalling cmdstanpy-1.1.0:
Successfully uninstalled cmdstanpy-1.1.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
prophet 1.1.2 requires cmdstanpy>=1.0.4, but you have cmdstanpy 0.9.5 which is incompatible.
Successfully installed cmdstanpy-0.9.5 fbprophet-0.7.1 setuptools-git-1.2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from google.colab import drive
import random
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf,month_plot,quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error as mse
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
import seaborn as sns
from fbprophet import Prophet
drive.mount('/content/drive')
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import *
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARMA
##Function to read and printing the shape of data
def read_store(location):
'''
location must be in string
'''
data=pd.read_csv(location)
display(data.head())
print("--------------------------------------------------------------------------")
print('\033[94m')
print("Shape of data is:",data.shape)
print("---------------------------------------------------------------------------\n")
print("Data types:")
display(data.dtypes)
print("---------------------------------------------------------------------------\n")
print('\033[94m')
print("Null vals info :")
display(data.isnull().sum())
return data
train_loc='/content/drive/MyDrive/Final_capstone/train_data/train_data.csv'
train_data1=read_store(train_loc)
train_data=train_data1.copy()
#Data is having huge records
train_wealoc='/content/drive/MyDrive/Final_capstone/train_data/train_weather-1646897968670.csv'
train_weather1=read_store(train_wealoc)
train_weather=train_weather1.copy()
farm_loc='/content/drive/MyDrive/Final_capstone/train_data/farm_data-1646897931981.csv'
train_farm=read_store(farm_loc)
farm_data=train_farm.copy()
total_loc='/content/drive/MyDrive/Final_capstone/train_data/total_data.csv'
total_data1=read_store(total_loc)
total_data=total_data1.copy()
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_110884 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 690.455096 | Obery Farms |
| 1 | 2016-01-01 00:00:00 | fid_90053 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 252.696160 | Obery Farms |
| 2 | 2016-01-01 00:00:00 | fid_17537 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 499.446528 | Obery Farms |
| 3 | 2016-01-01 00:00:00 | fid_110392 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 2200.407555 | Obery Farms |
| 4 | 2016-01-01 00:00:00 | fid_62402 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 10833.140121 | Obery Farms |
--------------------------------------------------------------------------
Shape of data is: (20511532, 13)
---------------------------------------------------------------------------
Data types:
date object farm_id object ingredient_type object yield float64 deidentified_location object temp_obs float64 wind_direction float64 dew_temp float64 pressure_sea_level float64 precipitation float64 wind_speed float64 farm_area float64 farming_company object dtype: object
---------------------------------------------------------------------------
Null vals info :
date 0 farm_id 0 ingredient_type 0 yield 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_area 0 farming_company 0 dtype: int64
#ingw
ingw=len(train_data[train_data['ingredient_type']=='ing_w']['farm_id'].value_counts().to_frame().reset_index())
#ingx
ingx=len(train_data[train_data['ingredient_type']=='ing_x']['farm_id'].value_counts().to_frame().reset_index())
#ingy
ingy=len(train_data[train_data['ingredient_type']=='ing_y']['farm_id'].value_counts().to_frame().reset_index())
#ingz
ingz=len(train_data[train_data['ingredient_type']=='ing_z']['farm_id'].value_counts().to_frame().reset_index())
x=['ing_w','ing_x','ing_y','ing_z']
y=[ingw,ingx,ingy,ingz]
ax = sns.barplot(x=x,y=y)
#ax.set_xticklabels(ax.get_xticklabels(),rotation=)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='gray', xytext=(0, 10),textcoords='offset points')
ax.set_xlabel("Ingredient_types")
ax.set_ylabel("No.of.processing_plants")
ax.set_title("No.of.processing_plants vs Ingredient_types")
plt.ylim(0, 1600)
plt.show()
per_ing=train_data[['ingredient_type','yield']].groupby(by='ingredient_type').sum()
per_ing['percentage']=per_ing['yield']/(per_ing['yield'].sum())
per_ing
| yield | percentage | |
|---|---|---|
| ingredient_type | ||
| ing_w | 2.060313e+09 | 0.048138 |
| ing_x | 2.649007e+09 | 0.061893 |
| ing_y | 3.760286e+10 | 0.878573 |
| ing_z | 4.877499e+08 | 0.011396 |
per_ing.reset_index(inplace=True)
fig, ax = plt.subplots(figsize=(8, 6))
ax.pie(per_ing['percentage'], labels=per_ing['ingredient_type'], autopct='%1.1f%%', startangle=90)
ax.set_title('% of yield occupied by various ingredients')
# show plot
plt.show()
total_data.head()
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_110884 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 690.455096 | Obery Farms |
| 1 | 2016-01-01 00:00:00 | fid_90053 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 252.696160 | Obery Farms |
| 2 | 2016-01-01 00:00:00 | fid_17537 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 499.446528 | Obery Farms |
| 3 | 2016-01-01 00:00:00 | fid_110392 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 2200.407555 | Obery Farms |
| 4 | 2016-01-01 00:00:00 | fid_62402 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 10833.140121 | Obery Farms |
eda=total_data[['farming_company','yield']].groupby(by='farming_company',as_index=False)['yield'].sum()
eda['yield%']=(eda['yield']/eda['yield'].sum())*100
eda.sort_values(by='yield%',inplace=True)
eda=eda.tail(4)
eda
| farming_company | yield | yield% | |
|---|---|---|---|
| 0 | Del Monte Foods | 6.075534e+08 | 1.415153 |
| 11 | Sanderson Farms | 1.076545e+09 | 2.507558 |
| 14 | Wayne Farms | 2.318198e+09 | 5.399698 |
| 8 | Obery Farms | 3.743627e+10 | 87.198996 |
100-eda['yield%'].sum()
3.478595106002615
new={'farming_company':'Other_companies','yield':0,'yield%':3.478595106002615}
eda=eda.append(new,ignore_index=True)
eda['yield%'].sum()
100.0
fig, ax = plt.subplots(figsize=(8, 6))
ax.pie(eda['yield%'], labels=eda['farming_company'], autopct='%1.1f%%', startangle=8)
ax.set_title('%yeild share of various farming companies')
# show plot
plt.show()
eda=total_data[['deidentified_location','yield']].groupby(by='deidentified_location',as_index=False)['yield'].sum()
eda['yield%']=(eda['yield']/eda['yield'].sum())*100
eda.sort_values(by='yield%',inplace=True)
eda=eda.tail(4)
100-(eda['yield%'].sum())
6.971137857751302
new={'deidentified_location':'Other_locations','yield':0,'yield%':6.971137857751302}
#f2 = {'Name': 'Amy', 'Maths': 89, 'Science': 93}
eda = eda.append(new, ignore_index = True)
fig, ax = plt.subplots(figsize=(8, 6))
ax.pie(eda['yield%'], labels=eda['deidentified_location'], autopct='%1.1f%%', startangle=4)
ax.set_title('%yeild coming from various locations')
# show plot
plt.show()
train_data.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_110884 | ing_w | 0.0 |
| 1 | 2016-01-01 00:00:00 | fid_90053 | ing_w | 0.0 |
| 2 | 2016-01-01 00:00:00 | fid_17537 | ing_w | 0.0 |
| 3 | 2016-01-01 00:00:00 | fid_110392 | ing_w | 0.0 |
| 4 | 2016-01-01 00:00:00 | fid_62402 | ing_w | 0.0 |
train_data.dtypes
date object farm_id object ingredient_type object yield float64 dtype: object
train_data.nunique()
date 8784 farm_id 1434 ingredient_type 4 yield 1688175 dtype: int64
##Extracting time stamp
#train_data['timestamp']=train_data['date'].apply(lambda x:x[11:19])
#converting date into Datetime type
#date=np.array(train_data['date'],dtype='datetime64[h]')
train_data['date']=pd.to_datetime(train_data['date'],format='%Y%m%d %H:%M:%S')
train_data.dtypes
date datetime64[ns] farm_id object ingredient_type object yield float64 dtype: object
train_data.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 0 | 2016-01-01 | fid_110884 | ing_w | 0.0 |
| 1 | 2016-01-01 | fid_90053 | ing_w | 0.0 |
| 2 | 2016-01-01 | fid_17537 | ing_w | 0.0 |
| 3 | 2016-01-01 | fid_110392 | ing_w | 0.0 |
| 4 | 2016-01-01 | fid_62402 | ing_w | 0.0 |
def minmaxdate(data,col_name):
print('Min date in data:--',data[col_name].min())
print('Max date in data:--',data[col_name].max())
minmaxdate(train_data,col_name='date')
Min date in data:-- 2016-01-01 00:00:00 Max date in data:-- 2016-12-31 23:00:00
train_data.describe(include='all')
<ipython-input-31-90d4f858b61d>:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now. train_data.describe(include='all')
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| count | 20216100 | 20216100 | 20216100 | 2.021610e+07 |
| unique | 8784 | 1434 | 4 | NaN |
| top | 2016-12-27 22:00:00 | fid_63700 | ing_w | NaN |
| freq | 2370 | 51227 | 12060910 | NaN |
| first | 2016-01-01 00:00:00 | NaN | NaN | NaN |
| last | 2016-12-31 23:00:00 | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | 2.117121e+03 |
| std | NaN | NaN | NaN | 1.532356e+05 |
| min | NaN | NaN | NaN | 0.000000e+00 |
| 25% | NaN | NaN | NaN | 1.830000e+01 |
| 50% | NaN | NaN | NaN | 7.877500e+01 |
| 75% | NaN | NaN | NaN | 2.679840e+02 |
| max | NaN | NaN | NaN | 2.190470e+07 |
##EDA
#1,TO know top and low crop yeild farms
train_data.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 0 | 2016-01-01 | fid_110884 | ing_w | 0.0 |
| 1 | 2016-01-01 | fid_90053 | ing_w | 0.0 |
| 2 | 2016-01-01 | fid_17537 | ing_w | 0.0 |
| 3 | 2016-01-01 | fid_110392 | ing_w | 0.0 |
| 4 | 2016-01-01 | fid_62402 | ing_w | 0.0 |
eda1=train_data.groupby(by='farm_id',as_index=False).sum()
eda1.head()
| farm_id | yield | |
|---|---|---|
| 0 | fid_100053 | 4.722934e+05 |
| 1 | fid_100107 | 5.786395e+07 |
| 2 | fid_100135 | 2.196656e+05 |
| 3 | fid_100249 | 1.574181e+05 |
| 4 | fid_100303 | 5.318755e+05 |
eda1.sort_values(by='yield',ascending=False,inplace=True,ignore_index=True)
eda1.head()
| farm_id | yield | |
|---|---|---|
| 0 | fid_72059 | 3.350620e+10 |
| 1 | fid_121183 | 4.404114e+08 |
| 2 | fid_118955 | 4.286342e+08 |
| 3 | fid_49646 | 3.904002e+08 |
| 4 | fid_37988 | 3.142415e+08 |
top5=eda1.head(5)
low5=eda1.tail(5)
fig, axes = plt.subplots(2,2, figsize=(10, 8))
fig.suptitle('Top and Low yielded farms')
ax1=sns.barplot(ax=axes[0, 0],data=top5,x=top5['farm_id'],y=top5['yield'])
ax1.set_xlabel("Farm_id")
ax1.set_ylabel("Yield(xe10)")
ax1.set_title('Top yielded farm')
ax1=sns.barplot(ax=axes[0, 1],data=low5,x=low5['farm_id'],y=low5['yield'])
ax1.set_xlabel("Farm_id")
ax1.set_ylabel("Yield(xe10)")
ax1.set_title('Low yielded farm')
axes[1,0].set_axis_off()
axes[1,1].set_axis_off()
train_data.columns
Index(['date', 'farm_id', 'ingredient_type', 'yield'], dtype='object')
train_data.nunique()
date 8784 farm_id 1434 ingredient_type 4 yield 1688175 dtype: int64
fid_72059=train_data[train_data['farm_id']=='fid_72059']
fid_72059.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 1533 | 2016-01-01 00:00:00 | fid_72059 | ing_w | 282.248 |
| 1534 | 2016-01-01 00:00:00 | fid_72059 | ing_y | 3241630.000 |
| 3829 | 2016-01-01 01:00:00 | fid_72059 | ing_w | 282.248 |
| 3830 | 2016-01-01 01:00:00 | fid_72059 | ing_y | 3241630.000 |
| 6125 | 2016-01-01 02:00:00 | fid_72059 | ing_w | 284.864 |
fid_72059.nunique()
date 8784 farm_id 1 ingredient_type 2 yield 12653 dtype: int64
fid_72059.set_index('date',inplace=True)
#single_day=fid_72059['2016-01-01 00:00:00':'2016-01-01 23:00:00']
fid_72059
| farm_id | ingredient_type | yield | |
|---|---|---|---|
| date | |||
| 2016-01-01 00:00:00 | fid_72059 | ing_w | 282.248 |
| 2016-01-01 00:00:00 | fid_72059 | ing_y | 3241630.000 |
| 2016-01-01 01:00:00 | fid_72059 | ing_w | 282.248 |
| 2016-01-01 01:00:00 | fid_72059 | ing_y | 3241630.000 |
| 2016-01-01 02:00:00 | fid_72059 | ing_w | 284.864 |
| ... | ... | ... | ... |
| 2016-12-31 21:00:00 | fid_72059 | ing_y | 4800.000 |
| 2016-12-31 22:00:00 | fid_72059 | ing_w | 266.022 |
| 2016-12-31 22:00:00 | fid_72059 | ing_y | 4896.000 |
| 2016-12-31 23:00:00 | fid_72059 | ing_w | 263.689 |
| 2016-12-31 23:00:00 | fid_72059 | ing_y | 5040.000 |
17566 rows × 3 columns
fig, axes = plt.subplots(2,2, figsize=(10, 8))
fig.suptitle('Analysis of ingredients for Top yielded farm')
ax1=fid_72059[fid_72059['ingredient_type']=='ing_w'].plot(ax=axes[0,0])
ax1.set_xlabel("Date")
ax1.set_ylabel("Yield(*e7)")
ax1.set_title('ing_w')
ax1=fid_72059[fid_72059['ingredient_type']=='ing_y'].plot(ax=axes[0,1])
ax1.set_xlabel("Date")
ax1.set_ylabel("Yield(*e7)")
ax1.set_title('ing_y')
axes[1,0].set_axis_off()
axes[1,1].set_axis_off()
fid_111424=train_data[train_data['farm_id']=='fid_111424']
fid_111424
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 888 | 2016-01-01 00:00:00 | fid_111424 | ing_w | 0.0002 |
| 3187 | 2016-01-01 01:00:00 | fid_111424 | ing_w | 0.0004 |
| 5482 | 2016-01-01 02:00:00 | fid_111424 | ing_w | 0.0004 |
| 7780 | 2016-01-01 03:00:00 | fid_111424 | ing_w | 0.0004 |
| 10075 | 2016-01-01 04:00:00 | fid_111424 | ing_w | 0.0004 |
| ... | ... | ... | ... | ... |
| 20205199 | 2016-12-31 19:00:00 | fid_111424 | ing_w | 0.0004 |
| 20207562 | 2016-12-31 20:00:00 | fid_111424 | ing_w | 0.0004 |
| 20209927 | 2016-12-31 21:00:00 | fid_111424 | ing_w | 0.0004 |
| 20212293 | 2016-12-31 22:00:00 | fid_111424 | ing_w | 0.0004 |
| 20214659 | 2016-12-31 23:00:00 | fid_111424 | ing_w | 0.0004 |
8784 rows × 4 columns
fid_111424.nunique()
date 8784 farm_id 1 ingredient_type 1 yield 2 dtype: int64
fid_111424.set_index('date',inplace=True)
#fig, axes = plt.subplots(1,1, figsize=(10, 8))
#fig.suptitle('Analysis of ingredients for Low yielded farm')
#ax1=sns.barplot(ax=axes[0, 0],data=top5,x=top5['farm_id'],y=top5['yield'])
ax1=fid_111424[fid_111424['ingredient_type']=='ing_w'].plot()
ax1.set_xlabel("Date")
ax1.set_ylabel("Yield")
ax1.set_title('Analysis of ingredients for Top yielded farm(only ing_w)')
Text(0.5, 1.0, 'Analysis of ingredients for Top yielded farm(only ing_w)')
eda2=train_data.copy()
eda2.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 0 | 2016-01-01 | fid_110884 | ing_w | 0.0 |
| 1 | 2016-01-01 | fid_90053 | ing_w | 0.0 |
| 2 | 2016-01-01 | fid_17537 | ing_w | 0.0 |
| 3 | 2016-01-01 | fid_110392 | ing_w | 0.0 |
| 4 | 2016-01-01 | fid_62402 | ing_w | 0.0 |
new=eda2[['date','ingredient_type','yield']]
new.head()
| date | ingredient_type | yield | |
|---|---|---|---|
| 0 | 2016-01-01 | ing_w | 0.0 |
| 1 | 2016-01-01 | ing_w | 0.0 |
| 2 | 2016-01-01 | ing_w | 0.0 |
| 3 | 2016-01-01 | ing_w | 0.0 |
| 4 | 2016-01-01 | ing_w | 0.0 |
eda2=new.groupby(by=['ingredient_type','date'],as_index=False).sum()
eda2.head()
| ingredient_type | date | yield | |
|---|---|---|---|
| 0 | ing_w | 2016-01-01 00:00:00 | 146796.9708 |
| 1 | ing_w | 2016-01-01 01:00:00 | 166439.5826 |
| 2 | ing_w | 2016-01-01 02:00:00 | 166652.5184 |
| 3 | ing_w | 2016-01-01 03:00:00 | 166988.5308 |
| 4 | ing_w | 2016-01-01 04:00:00 | 167381.4970 |
eda2_pivot=eda2.pivot(index='date',columns='ingredient_type',values='yield')
eda2_pivot.head()
| ingredient_type | ing_w | ing_x | ing_y | ing_z |
|---|---|---|---|---|
| date | ||||
| 2016-01-01 00:00:00 | 146796.9708 | 52396.5669 | 3.924435e+06 | 87655.1681 |
| 2016-01-01 01:00:00 | 166439.5826 | 59582.4064 | 3.964789e+06 | 60670.8718 |
| 2016-01-01 02:00:00 | 166652.5184 | 58646.9936 | 9.660917e+05 | 68204.1314 |
| 2016-01-01 03:00:00 | 166988.5308 | 55827.6281 | 7.249452e+05 | 61265.7329 |
| 2016-01-01 04:00:00 | 167381.4970 | 53843.3643 | 8.847084e+05 | 64722.2291 |
print(eda2_pivot.columns)
eda2_pivot.columns.name=None
Index(['ing_w', 'ing_x', 'ing_y', 'ing_z'], dtype='object', name='ingredient_type')
eda2_pivot.head()
| ing_w | ing_x | ing_y | ing_z | |
|---|---|---|---|---|
| date | ||||
| 2016-01-01 00:00:00 | 146796.9708 | 52396.5669 | 3.924435e+06 | 87655.1681 |
| 2016-01-01 01:00:00 | 166439.5826 | 59582.4064 | 3.964789e+06 | 60670.8718 |
| 2016-01-01 02:00:00 | 166652.5184 | 58646.9936 | 9.660917e+05 | 68204.1314 |
| 2016-01-01 03:00:00 | 166988.5308 | 55827.6281 | 7.249452e+05 | 61265.7329 |
| 2016-01-01 04:00:00 | 167381.4970 | 53843.3643 | 8.847084e+05 | 64722.2291 |
eda2_pivot['Total']=eda2_pivot['ing_w']+eda2_pivot['ing_x']+eda2_pivot['ing_y']+eda2_pivot['ing_z']
eda2_pivot.head()
| ing_w | ing_x | ing_y | ing_z | Total | |
|---|---|---|---|---|---|
| date | |||||
| 2016-01-01 00:00:00 | 146796.9708 | 52396.5669 | 3.924435e+06 | 87655.1681 | 4.211284e+06 |
| 2016-01-01 01:00:00 | 166439.5826 | 59582.4064 | 3.964789e+06 | 60670.8718 | 4.251482e+06 |
| 2016-01-01 02:00:00 | 166652.5184 | 58646.9936 | 9.660917e+05 | 68204.1314 | 1.259595e+06 |
| 2016-01-01 03:00:00 | 166988.5308 | 55827.6281 | 7.249452e+05 | 61265.7329 | 1.009027e+06 |
| 2016-01-01 04:00:00 | 167381.4970 | 53843.3643 | 8.847084e+05 | 64722.2291 | 1.170655e+06 |
fig, axes = plt.subplots(2,2, figsize=(10, 8))
fig.suptitle('Analysis of individual ingredients')
#For ing_w
ax1=eda2_pivot['ing_w'].plot(ax=axes[0,0])
#ax1.set_xlabel("Date")
ax1.set_ylabel("Yield")
ax1.legend("w")
#For ing_x
ax1=eda2_pivot['ing_x'].plot(ax=axes[0,1])
#ax1.set_xlabel("Date")
ax1.set_ylabel("Yield(*e6)")
#ax1.set_title('')
ax1.legend("x")
#For ing_y
ax1=eda2_pivot['ing_y'].plot(ax=axes[1,0])
#ax1.set_xlabel("Date")
ax1.set_ylabel("Yield(*e7)")
#ax1.set_title('ing_y')
ax1.legend("y")
#For ing_z
ax1=eda2_pivot['ing_z'].plot(ax=axes[1,1])
#ax1.set_xlabel("Date")
ax1.set_ylabel("Yield")
#ax1.set_title('ing_z')
ax1.legend("z")
<matplotlib.legend.Legend at 0x7f508e380100>
fig, axes = plt.subplots(2,2, figsize=(10, 8))
fig.suptitle('Analysis of ingredients')
#For ing_w
ax1=eda2_pivot.drop('Total',axis=1).plot(ax=axes[0,0],figsize=(10, 10))
ax1.set_xlabel("Date")
ax1.set_ylabel("Yield")
ax1.set_title("all ingredients")
#For ing_x
ax1=eda2_pivot['Total'].plot(ax=axes[0,1],figsize=(10, 10))
ax1.set_xlabel("Date")
ax1.set_ylabel("Yield(*e6)")
ax1.set_title('Total ingredients')
axes[1,0].set_axis_off()
axes[1,1].set_axis_off()
train_data.head()
| date | farm_id | ingredient_type | yield | |
|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_110884 | ing_w | 0.0 |
| 1 | 2016-01-01 00:00:00 | fid_90053 | ing_w | 0.0 |
| 2 | 2016-01-01 00:00:00 | fid_17537 | ing_w | 0.0 |
| 3 | 2016-01-01 00:00:00 | fid_110392 | ing_w | 0.0 |
| 4 | 2016-01-01 00:00:00 | fid_62402 | ing_w | 0.0 |
train_data.shape
(20216100, 4)
train_data.isnull().sum()
date 0 farm_id 0 ingredient_type 0 yield 0 dtype: int64
train_weather.head()
| timestamp | deidentified_location | temp_obs | cloudiness | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | location 7369 | 25.0 | 6.0 | 0.0 | 20.0 | 1019.7 | NaN | 0.0 |
| 1 | 2016-01-01 01:00:00 | location 7369 | 24.4 | NaN | 70.0 | 21.1 | 1020.2 | -1.0 | 1.5 |
| 2 | 2016-01-01 02:00:00 | location 7369 | 22.8 | 2.0 | 0.0 | 21.1 | 1020.2 | 0.0 | 0.0 |
| 3 | 2016-01-01 03:00:00 | location 7369 | 21.1 | 2.0 | 0.0 | 20.6 | 1020.1 | 0.0 | 0.0 |
| 4 | 2016-01-01 04:00:00 | location 7369 | 20.0 | 2.0 | 250.0 | 20.0 | 1020.0 | -1.0 | 2.6 |
train_weather.shape
(139773, 9)
train_weather.isnull().sum()
timestamp 0 deidentified_location 0 temp_obs 55 cloudiness 69173 wind_direction 6268 dew_temp 113 pressure_sea_level 10618 precipitation 50289 wind_speed 304 dtype: int64
train_weather.drop('cloudiness',axis=1,inplace=True)
##because greaterthan 50 % null vals are there
train_weather.rename(columns={'timestamp':'date'},inplace=True)
train_weather['date']=pd.to_datetime(train_weather['date'])
train_weather.dtypes
date datetime64[ns] deidentified_location object temp_obs float64 wind_direction float64 dew_temp float64 pressure_sea_level float64 precipitation float64 wind_speed float64 dtype: object
train_weather.set_index('date',inplace=True)
train_weather.dtypes
deidentified_location object temp_obs float64 wind_direction float64 dew_temp float64 pressure_sea_level float64 precipitation float64 wind_speed float64 dtype: object
train_weather.interpolate(method='time',inplace=True)
train_weather.isnull().sum()
deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 1 wind_speed 0 dtype: int64
train_weather.plot()
<AxesSubplot:xlabel='date'>
train_weather.reset_index(inplace=True)
train_weather.fillna(0,inplace=True)
sns.heatmap(train_weather.corr(),annot=True,cmap='Greens')
<AxesSubplot:>
train_weather['date']=pd.to_datetime(train_weather['date'],format='%Y%m%d %H:%M:%S')
train_weather.dtypes
date datetime64[ns] deidentified_location object temp_obs float64 cloudiness float64 wind_direction float64 dew_temp float64 pressure_sea_level float64 precipitation float64 wind_speed float64 dtype: object
#minmaxdate(train_weather,'timestamp')
train_weather.describe()
| temp_obs | cloudiness | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | |
|---|---|---|---|---|---|---|---|
| count | 139718.000000 | 70600.000000 | 133505.000000 | 139660.000000 | 129155.000000 | 89484.000000 | 139469.000000 |
| mean | 14.418106 | 2.149306 | 180.526632 | 7.350158 | 1016.158038 | 0.983047 | 3.560527 |
| std | 10.626595 | 2.599150 | 111.523629 | 9.790235 | 7.629684 | 8.463678 | 2.335874 |
| min | -28.900000 | 0.000000 | 0.000000 | -35.000000 | 968.200000 | -1.000000 | 0.000000 |
| 25% | 7.200000 | 0.000000 | 80.000000 | 0.600000 | 1011.800000 | 0.000000 | 2.100000 |
| 50% | 15.000000 | 2.000000 | 190.000000 | 8.300000 | 1016.400000 | 0.000000 | 3.100000 |
| 75% | 22.200000 | 4.000000 | 280.000000 | 14.400000 | 1020.800000 | 0.000000 | 5.000000 |
| max | 47.200000 | 9.000000 | 360.000000 | 26.100000 | 1045.500000 | 343.000000 | 19.000000 |
farm_data.head()
| farm_id | operations_commencing_year | num_processing_plants | farm_area | farming_company | deidentified_location | |
|---|---|---|---|---|---|---|
| 0 | fid_110884 | 2008.0 | NaN | 690.455096 | Obery Farms | location 7369 |
| 1 | fid_90053 | 2004.0 | NaN | 252.696160 | Obery Farms | location 7369 |
| 2 | fid_17537 | 1991.0 | NaN | 499.446528 | Obery Farms | location 7369 |
| 3 | fid_110392 | 2002.0 | NaN | 2200.407555 | Obery Farms | location 7369 |
| 4 | fid_62402 | 1975.0 | NaN | 10833.140121 | Obery Farms | location 7369 |
farm_data.shape
(1449, 6)
farm_data.isnull().sum()
farm_id 0 operations_commencing_year 774 num_processing_plants 1094 farm_area 0 farming_company 0 deidentified_location 0 dtype: int64
farm_data.drop(['operations_commencing_year','num_processing_plants'],axis=1,inplace=True)
farm_data.shape
(1449, 4)
farm_data.nunique()
farm_id 1434 farm_area 1397 farming_company 16 deidentified_location 16 dtype: int64
train_weather.shape
(139773, 8)
train_weather.columns
Index(['date', 'deidentified_location', 'temp_obs', 'wind_direction',
'dew_temp', 'pressure_sea_level', 'precipitation', 'wind_speed'],
dtype='object')
farm_data.shape
(1449, 4)
farmwea=train_weather.merge(farm_data,on='deidentified_location')
farmwea.shape
(12676166, 11)
farmwea.columns
Index(['date', 'deidentified_location', 'temp_obs', 'wind_direction',
'dew_temp', 'pressure_sea_level', 'precipitation', 'wind_speed',
'farm_id', 'farm_area', 'farming_company'],
dtype='object')
train_data['date']=pd.to_datetime(train_data['date'])
total_data=train_data.merge(farmwea,on=['farm_id','date'])
total_data.shape
(20511532, 13)
#total_data.to_csv("/content/drive/MyDrive/Final_capstone/total_data.csv",index=False)
test_data=pd.read_csv("/content/drive/MyDrive/Final_capstone/test_data/test_data-1664552867678.csv")
test_data.shape
(20848800, 4)
test_data.isnull().sum()
date 0 farm_id 0 ingredient_type 0 id 0 dtype: int64
test_data['date']=pd.to_datetime(test_data['date'])
test_weather=pd.read_csv("/content/drive/MyDrive/Final_capstone/test_data/test_weather-1646897984996-1664552604982.csv")
test_weather.isnull().sum()
timestamp 0 deidentified_location 0 temp_obs 43 cloudiness 69065 wind_direction 6134 dew_temp 184 pressure_sea_level 10883 precipitation 47788 wind_speed 240 dtype: int64
test_weather.drop('cloudiness',axis=1,inplace=True)
test_weather.rename(columns={'timestamp':'date'},inplace=True)
test_weather['date']=pd.to_datetime(test_weather['date'])
test_weather.set_index('date',inplace=True)
test_weather.interpolate(method='time',inplace=True)
test_weather.reset_index(inplace=True)
test_weather.isnull().sum()
date 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 1 wind_speed 0 dtype: int64
test_weather.fillna(0,inplace=True)
farm_data.head()
| farm_id | operations_commencing_year | num_processing_plants | farm_area | farming_company | deidentified_location | |
|---|---|---|---|---|---|---|
| 0 | fid_110884 | 2008.0 | NaN | 690.455096 | Obery Farms | location 7369 |
| 1 | fid_90053 | 2004.0 | NaN | 252.696160 | Obery Farms | location 7369 |
| 2 | fid_17537 | 1991.0 | NaN | 499.446528 | Obery Farms | location 7369 |
| 3 | fid_110392 | 2002.0 | NaN | 2200.407555 | Obery Farms | location 7369 |
| 4 | fid_62402 | 1975.0 | NaN | 10833.140121 | Obery Farms | location 7369 |
farm_data1=farm_data.copy()
farm_data1.drop(['num_processing_plants','operations_commencing_year'],axis=1,inplace=True)
weafarm=test_weather.merge(farm_data1,on=['deidentified_location'])
weafarm.isnull().sum()
date 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_id 0 farm_area 0 farming_company 0 dtype: int64
test_merged=test_data.merge(weafarm,on=['date','farm_id'])
test_merged.to_csv("/content/drive/MyDrive/Final_capstone/test_data/test_merged.csv",index=False)
test_merged.shape
(21154797, 13)
#ingw
ingw=len(test_merged[test_merged['ingredient_type']=='ing_w']['farm_id'].value_counts().to_frame().reset_index())
#ingx
ingx=len(test_merged[test_merged['ingredient_type']=='ing_x']['farm_id'].value_counts().to_frame().reset_index())
#ingy
ingy=len(test_merged[test_merged['ingredient_type']=='ing_y']['farm_id'].value_counts().to_frame().reset_index())
#ingz
ingz=len(test_merged[test_merged['ingredient_type']=='ing_z']['farm_id'].value_counts().to_frame().reset_index())
x=['ing_w','ing_x','ing_y','ing_z']
y=[ingw,ingx,ingy,ingz]
ax = sns.barplot(x=x,y=y)
#ax.set_xticklabels(ax.get_xticklabels(),rotation=)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=11, color='gray', xytext=(0, 10),textcoords='offset points')
ax.set_xlabel("Ingredient_types")
ax.set_ylabel("No.of.processing_plants")
ax.set_title("No.of.processing_plants vs Ingredient_types")
plt.ylim(0, 1600)
plt.show()
data=train_data[train_data['farm_id']=='fid_72059']
data.isnull().sum()
data.nunique()
data['date']=pd.to_datetime(data['date'],format='%Y%m%d %H:%M:%S')
data.dtypes
data.set_index('date',inplace=True)
data=data[['ingredient_type','yield']]
data.head()
data_pivot=data.pivot(columns='ingredient_type',values='yield')
data_pivot
data_pivot.isnull().sum()
data_pivot.interpolate(method='time',inplace=True)
data_pivot.columns.name=None
new_index=pd.date_range(start='2016-01-01 00:00:00',end='2016-12-31 23:00:00',freq='H')
data_pivot1=data_pivot.reindex(new_index,fill_value=0)
data_pivot1.isnull().sum()
data_pivot1['ing_w'].plot(figsize=(25,7))
data_pivot1['ing_y'].plot(figsize=(25,7))
ing_w=data_pivot1[['ing_w']]
ing_w.head()
31*24
seasonal_decompose(ing_w,period=744).plot();
#checking stationary
def adf(data):
adf,pval,usedlag,nobs,cric_val,icbest=adfuller(data)
if (pval<0.05):
return 'stationary'
else:
return 'nonstationary'
adf(ing_w['ing_w'])
def plots(data, lags=None):
fig, axes = plt.subplots(2,2, figsize=(15, 8))
fig.suptitle('Prediction Plots')
#For data
ax1=data.plot(ax=axes[0,0])
#For train
ax1=sm.tsa.graphics.plot_acf(data, lags=lags, zero=False,ax=axes[0,1])
#For test
ax1=sm.tsa.graphics.plot_pacf(data, lags=lags, zero = False,ax=axes[1,0])
axes[1,1].set_axis_off()
plots(ing_w)
id1_train=ing_w['2016-01':'2016-10']
id1_test=ing_w['2016-11':'2016-12']
id1_train
id_train=id1_train.copy()
id_test=id1_test.copy()
id_train
#ARMA Model
import statsmodels.api as sm
def arma_models(p,q,train,test):
model = sm.tsa.ARMA(train, (p, q)).fit(trend='nc', disp=0)
train['predictions']=model.predict(start='2016-01-01 00:00:00',end='2016-10-31 23:00:00')
test['predictions']=model.predict(start='2016-11-01 00:00:00',end='2016-12-31 23:00:00')
fig, axes = plt.subplots(2,2, figsize=(15, 8))
fig.suptitle('Prediction Plots')
#For data
residuals=test['ing_w']-test['predictions']
ax1=residuals.plot(kind='kde',ax=axes[0,0])
ax1.set_title('Resid_plot')
#For train
ax1=train.plot(ax=axes[0,1])
#For test
ax1=test.plot(ax=axes[1,0])
axes[1,1].set_axis_off()
#performance
print('Train RMSE:',np.sqrt(mse(train['ing_w'],train['predictions'])))
print('Test RMSE:',np.sqrt(mse(test['ing_w'],test['predictions'])))
arma_models(3,2,id_train,id_test)
#SARIMAX Model
id_train=id1_train.copy()
id_test=id1_test.copy()
id_train
| ing_w | |
|---|---|
| 2016-01-01 00:00:00 | 282.248 |
| 2016-01-01 01:00:00 | 282.248 |
| 2016-01-01 02:00:00 | 284.864 |
| 2016-01-01 03:00:00 | 280.287 |
| 2016-01-01 04:00:00 | 279.975 |
| ... | ... |
| 2016-10-31 19:00:00 | 641.319 |
| 2016-10-31 20:00:00 | 610.620 |
| 2016-10-31 21:00:00 | 545.559 |
| 2016-10-31 22:00:00 | 463.929 |
| 2016-10-31 23:00:00 | 438.585 |
7320 rows × 1 columns
def sarima_models(p,d,q,s1,s2,s3,s4,train,test):
model = sm.tsa.statespace.SARIMAX(id1_train, order=(p,d,q), seasonal_order=(s1,s2,s3,s4), trend='c').fit()
train['predictions']=model.predict(start='2016-01-01 00:00:00',end='2016-10-31 23:00:00')
test['predictions']=model.predict(start='2016-11-01 00:00:00',end='2016-12-31 23:00:00')
model.plot_diagnostics(lags=12,figsize = (20,10),)
fig, axes = plt.subplots(2,2, figsize=(15, 8))
fig.suptitle('Prediction Plots')
#For data
residuals=test['ing_w']-test['predictions']
ax1=residuals.plot(kind='kde',ax=axes[0,0])
ax1.set_title('Resid_plot')
#For train
ax1=train.plot(ax=axes[0,1])
#For test
ax1=test.plot(ax=axes[1,0])
axes[1,1].set_axis_off()
#performance
print('Train RMSE:',np.sqrt(mse(train['ing_w'],train['predictions'])))
print('Test RMSE:',np.sqrt(mse(test['ing_w'],test['predictions'])))
sarima_models(3,2,3,0,1,1,4,id_train,id_test)
Train RMSE: 43.88333679551119 Test RMSE: 1470776.7691647173
#fb prophet model
id_train=id1_train.copy()
id_test=id1_test.copy()
id_train
id_train.reset_index(inplace=True)
id_test.reset_index(inplace=True)
id_train.rename(columns={'index':'ds','ing_w':'y'},inplace=True)
id_test.rename(columns={'index':'ds','ing_w':'y'},inplace=True)
id_train
m = Prophet()
m.fit(id_train)
id_test
forecast_train=m.predict(id_train[['ds']])
forecast_train.head()
id_train['yhat']=forecast_train['yhat']
id_train.set_index('ds',inplace=True)
id_train.plot(figsize=(15,8))
future=id_test[['ds']]
future.dtypes
ds datetime64[ns] dtype: object
forecast = m.predict(future)
print(forecast.columns)
Index(['ds', 'trend', 'yhat_lower', 'yhat_upper', 'trend_lower', 'trend_upper',
'additive_terms', 'additive_terms_lower', 'additive_terms_upper',
'daily', 'daily_lower', 'daily_upper', 'weekly', 'weekly_lower',
'weekly_upper', 'multiplicative_terms', 'multiplicative_terms_lower',
'multiplicative_terms_upper', 'yhat'],
dtype='object')
id_test['yhat']=forecast['yhat']
id_test.set_index('ds',inplace=True)
id_test
| y | yhat | |
|---|---|---|
| ds | ||
| 2016-11-01 00:00:00 | 403.316 | 462.124815 |
| 2016-11-01 01:00:00 | 396.439 | 440.862675 |
| 2016-11-01 02:00:00 | 388.837 | 429.964579 |
| 2016-11-01 03:00:00 | 382.315 | 424.267480 |
| 2016-11-01 04:00:00 | 373.147 | 420.988476 |
| ... | ... | ... |
| 2016-12-31 19:00:00 | 299.058 | 267.416923 |
| 2016-12-31 20:00:00 | 302.477 | 218.405897 |
| 2016-12-31 21:00:00 | 269.963 | 170.632402 |
| 2016-12-31 22:00:00 | 266.022 | 122.052371 |
| 2016-12-31 23:00:00 | 263.689 | 75.209165 |
1464 rows × 2 columns
id_test.plot(figsize=(15,8))
<AxesSubplot:xlabel='ds'>
#2017 forecast using FB Prophet
ingw=ing_w.copy()
ingw.reset_index(inplace=True)
ingw.head()
| ds | y | |
|---|---|---|
| 0 | 2016-01-01 00:00:00 | 282.248 |
| 1 | 2016-01-01 01:00:00 | 282.248 |
| 2 | 2016-01-01 02:00:00 | 284.864 |
| 3 | 2016-01-01 03:00:00 | 280.287 |
| 4 | 2016-01-01 04:00:00 | 279.975 |
ingw.rename(columns={'index':'ds','ing_w':'y'},inplace=True)
future_date=pd.date_range(start='2017-01-01 00:00:00',end='2017-12-31 23:00:00',freq="H")
future=pd.DataFrame(future_date,columns=['ds'])
m = Prophet()
m.fit(ingw)
forecast=m.predict(future[['ds']])[['ds','yhat']]
forecast.head()
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
| ds | yhat | |
|---|---|---|
| 0 | 2017-01-01 00:00:00 | 81.912578 |
| 1 | 2017-01-01 01:00:00 | 56.634136 |
| 2 | 2017-01-01 02:00:00 | 42.172850 |
| 3 | 2017-01-01 03:00:00 | 33.267048 |
| 4 | 2017-01-01 04:00:00 | 27.641794 |
forecast.set_index('ds',inplace=True)
forecast.plot(figsize=(15,6))
<AxesSubplot:xlabel='ds'>
ing_w.head()
| ing_w | |
|---|---|
| 2016-01-01 00:00:00 | 282.248 |
| 2016-01-01 01:00:00 | 282.248 |
| 2016-01-01 02:00:00 | 284.864 |
| 2016-01-01 03:00:00 | 280.287 |
| 2016-01-01 04:00:00 | 279.975 |
ingw=ing_w['ing_w']
ingw
2016-01-01 00:00:00 282.248
2016-01-01 01:00:00 282.248
2016-01-01 02:00:00 284.864
2016-01-01 03:00:00 280.287
2016-01-01 04:00:00 279.975
...
2016-12-31 19:00:00 299.058
2016-12-31 20:00:00 302.477
2016-12-31 21:00:00 269.963
2016-12-31 22:00:00 266.022
2016-12-31 23:00:00 263.689
Freq: H, Name: ing_w, Length: 8784, dtype: float64
ingw.shape
(8784,)
def df_to_X_y(df, window_size=365):
df_as_np = df.to_numpy()
X = []
y = []
for i in range(len(df_as_np)-window_size):
row = [[a] for a in df_as_np[i:i+window_size]]
X.append(row)
label = df_as_np[i+window_size]
y.append(label)
return np.array(X), np.array(y)
WINDOW_SIZE=365
X1, y1 = df_to_X_y(ingw, WINDOW_SIZE)
X1.shape, y1.shape
((8419, 365, 1), (8419,))
X_train1, y_train1 = X1[:7000], y1[:7000]
X_val1, y_val1 = X1[7000:7500], y1[7000:7500]
X_test1, y_test1 = X1[7500:], y1[7500:]
X_train1.shape, y_train1.shape, X_val1.shape, y_val1.shape, X_test1.shape, y_test1.shape
((7000, 365, 1), (7000,), (500, 365, 1), (500,), (919, 365, 1), (919,))
model1 = Sequential()
model1.add(InputLayer((365, 1)))
model1.add(LSTM(64))
model1.add(Dense(8, 'relu'))
model1.add(Dense(1, 'relu'))
model1.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_1 (LSTM) (None, 64) 16896
dense_2 (Dense) (None, 8) 520
dense_3 (Dense) (None, 1) 9
=================================================================
Total params: 17,425
Trainable params: 17,425
Non-trainable params: 0
_________________________________________________________________
cp1 = ModelCheckpoint('model1/', save_best_only=True)
model1.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.0001), metrics=[RootMeanSquaredError()])
model1.fit(X_train1, y_train1, validation_data=(X_val1, y_val1), epochs=10, callbacks=[cp1])
Epoch 1/10 216/219 [============================>.] - ETA: 0s - loss: 412246.1875 - root_mean_squared_error: 642.0640
WARNING:absl:Found untraced functions such as lstm_cell_2_layer_call_fn, lstm_cell_2_layer_call_and_return_conditional_losses while saving (showing 2 of 2). These functions will not be directly callable after loading.
219/219 [==============================] - 11s 34ms/step - loss: 412365.0938 - root_mean_squared_error: 642.1566 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 2/10 219/219 [==============================] - 4s 18ms/step - loss: 412365.1562 - root_mean_squared_error: 642.1566 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 3/10 219/219 [==============================] - 3s 16ms/step - loss: 412365.1875 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 4/10 219/219 [==============================] - 3s 14ms/step - loss: 412365.2500 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 5/10 219/219 [==============================] - 3s 14ms/step - loss: 412365.1875 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 6/10 219/219 [==============================] - 4s 18ms/step - loss: 412365.1875 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 7/10 219/219 [==============================] - 3s 16ms/step - loss: 412365.1875 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 8/10 219/219 [==============================] - 3s 14ms/step - loss: 412365.0625 - root_mean_squared_error: 642.1566 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 9/10 219/219 [==============================] - 3s 14ms/step - loss: 412365.0938 - root_mean_squared_error: 642.1566 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531 Epoch 10/10 219/219 [==============================] - 4s 18ms/step - loss: 412365.2500 - root_mean_squared_error: 642.1567 - val_loss: 351237.7500 - val_root_mean_squared_error: 592.6531
<keras.callbacks.History at 0x7f4f201f46a0>
from tensorflow.keras.models import load_model
model1 = load_model('model1/')
train_predictions = model1.predict(X_train1).flatten()
train_results = pd.DataFrame(data={'Train Predictions':train_predictions, 'Actuals':y_train1})
train_results
219/219 [==============================] - 2s 7ms/step
| Train Predictions | Actuals | |
|---|---|---|
| 0 | 0.0 | 283.234 |
| 1 | 0.0 | 287.805 |
| 2 | 0.0 | 291.718 |
| 3 | 0.0 | 291.752 |
| 4 | 0.0 | 310.203 |
| ... | ... | ... |
| 6995 | 0.0 | 954.989 |
| 6996 | 0.0 | 894.432 |
| 6997 | 0.0 | 852.906 |
| 6998 | 0.0 | 822.720 |
| 6999 | 0.0 | 761.728 |
7000 rows × 2 columns
train_results['Train Predictions'].plot()
<AxesSubplot:>
plt.plot(train_results['Train Predictions'])
plt.plot(train_results['Actuals'])
[<matplotlib.lines.Line2D at 0x7f4f1de4f700>]
def farm_id_extract(data):
farm_id=data['farm_id'].value_counts().to_frame()
farm_id.reset_index(inplace=True)
list=farm_id['index'].to_list()
print(len(list))
return list
farms=farm_id_extract(train_data)
1434
farms1=farms[0:700]
farms2=farms[700:]
def adf(data):
adf,pval,usedlag,nobs,cric_val,icbest=adfuller(data)
if (pval<0.05):
return 'stationary'
else:
return 'nonstationary'
def stat_ns(data,farm_ids):
ing_w_st=[]
ing_x_st=[]
ing_y_st=[]
ing_z_st=[]
ing_w_ns=[]
ing_x_ns=[]
ing_y_ns=[]
ing_z_ns=[]
s=0
for farm in farm_ids:
s=s+1
id1=data[data['farm_id']==farm]
id1_s1=id1.groupby(by=['date','ingredient_type'],as_index=False).agg('sum')
id1_pivot=id1_s1.pivot(index='date',columns='ingredient_type',values='yield')
id1_pivot.columns.name=None
cols=id1_pivot.columns.to_list()
if 'ing_w' in cols:
print("status(running_id):",s)
id1_pivot['ing_w'].fillna(0,inplace=True)
if(adf(id1_pivot['ing_w'])=='stationary'):
ing_w_st.append(farm)
else:
ing_w_ns.append(farm)
if 'ing_x' in cols:
id1_pivot['ing_x'].fillna(0,inplace=True)
if(adf(id1_pivot['ing_x'])=='stationary'):
ing_x_st.append(farm)
else:
ing_x_ns.append(farm)
if 'ing_y' in cols:
id1_pivot['ing_y'].fillna(0,inplace=True)
if(adf(id1_pivot['ing_y'])=='stationary'):
ing_y_st.append(farm)
else:
ing_y_ns.append(farm)
if 'ing_z' in cols:
id1_pivot['ing_z'].fillna(0,inplace=True)
if(adf(id1_pivot['ing_z'])=='stationary'):
ing_z_st.append(farm)
else:
ing_z_ns.append(farm)
#st_ns=pd.DataFrame()
#st_ns['ing_w_st']=ing_w_st
#st_ns['ing_w_ns']=ing_w_ns
#st_ns['ing_x_st']=ing_x_st
#st_ns['ing_x_ns']=ing_x_ns
#st_ns['ing_y_st']=ing_y_st
#st_ns['ing_y_ns']=ing_y_ns
#st_ns['ing_z_st']=ing_z_st
#st_ns['ing_z_ns']=ing_z_ns
#print('No.of.ids:',len(ing_w_st)+len(ing_w_ns)+len(ing_x_st)+len(ing_x_ns)+len(ing_y_st)+len(ing_y_ns)+len(ing_z_st)+len(ing_z_ns))
#d={'ing_w_st'}
return ing_w_st,ing_w_ns,ing_x_st,ing_x_ns,ing_y_st,ing_y_ns,ing_z_st,ing_z_ns
ing_w_st1,ing_w_ns1,ing_x_st1,ing_x_ns1,ing_y_st1,ing_y_ns1,ing_z_st1,ing_z_ns1=st_ns_data=stat_ns(train_data,farms1)
def make_same_lenlist(list,max_size):
if(len(list)!=max_size):
a=max_size-len(list)
for i in range(0,a):
list.append('NA')
return list
set1=pd.DataFrame({'ing_w_st':make_same_lenlist(ing_w_st1,700),'ing_w_ns':make_same_lenlist(ing_w_ns1,700),'ing_x_st':make_same_lenlist(ing_x_st1,700),'ing_x_ns':make_same_lenlist(ing_x_ns1,700),'ing_y_st':make_same_lenlist(ing_y_st1,700),'ing_y_ns':make_same_lenlist(ing_y_ns1,700),'ing_z_st':make_same_lenlist(ing_z_st1,700),'ing_z_ns':make_same_lenlist(ing_z_ns1,700)})
set1.to_csv('/content/drive/MyDrive/Final_capstone/train_data/st_ns_set1.csv',index=False)
ing_w_st2,ing_w_ns2,ing_x_st2,ing_x_ns2,ing_y_st2,ing_y_ns2,ing_z_st2,ing_z_ns2=st_ns_data=stat_ns(train_data,farms2)
set2=pd.DataFrame({'ing_w_st':make_same_lenlist(ing_w_st2,734),'ing_w_ns':make_same_lenlist(ing_w_ns2,734),'ing_x_st':make_same_lenlist(ing_x_st2,734),'ing_x_ns':make_same_lenlist(ing_x_ns2,734),'ing_y_st':make_same_lenlist(ing_y_st2,734),'ing_y_ns':make_same_lenlist(ing_y_ns2,734),'ing_z_st':make_same_lenlist(ing_z_st2,734),'ing_z_ns':make_same_lenlist(ing_z_ns2,734)})
set2.to_csv('/content/drive/MyDrive/Final_capstone/train_data/st_ns_set2.csv',index=False)
na=['NA']
set1=pd.read_csv("/content/drive/MyDrive/Final_capstone/train_data/st_ns_set1.csv",na_values=na)
set1.head()
| ing_w_st | ing_w_ns | ing_x_st | ing_x_ns | ing_y_st | ing_y_ns | ing_z_st | ing_z_ns | |
|---|---|---|---|---|---|---|---|---|
| 0 | fid_63700 | fid_82299 | fid_63700 | fid_40569 | fid_63700 | fid_122174 | fid_63700 | fid_48766 |
| 1 | fid_68761 | fid_23693 | fid_68761 | fid_91594 | fid_103446 | fid_46342 | fid_68761 | fid_33276 |
| 2 | fid_103446 | fid_91594 | fid_103446 | fid_27582 | fid_109814 | fid_27773 | fid_103446 | fid_99942 |
| 3 | fid_109814 | fid_102220 | fid_109814 | fid_94240 | fid_117459 | fid_54695 | fid_109814 | fid_69285 |
| 4 | fid_122174 | fid_40785 | fid_122174 | fid_35661 | fid_47796 | fid_118135 | fid_73431 | fid_25582 |
set2=pd.read_csv("/content/drive/MyDrive/Final_capstone/train_data/st_ns_set2.csv",na_values=na)
set2.head()
| ing_w_st | ing_w_ns | ing_x_st | ing_x_ns | ing_y_st | ing_y_ns | ing_z_st | ing_z_ns | |
|---|---|---|---|---|---|---|---|---|
| 0 | fid_71616 | fid_107436 | fid_98608 | NaN | fid_75161 | fid_20282 | NaN | NaN |
| 1 | fid_23619 | fid_107874 | fid_24451 | NaN | fid_40374 | fid_115504 | NaN | NaN |
| 2 | fid_100807 | fid_12360 | fid_109189 | NaN | fid_51284 | fid_60652 | NaN | NaN |
| 3 | fid_91133 | fid_105593 | fid_39966 | NaN | fid_93390 | NaN | NaN | NaN |
| 4 | fid_64441 | fid_76042 | fid_115974 | NaN | fid_31488 | NaN | NaN | NaN |
sets=pd.concat([set1,set2],axis=0)
sets.head()
| ing_w_st | ing_w_ns | ing_x_st | ing_x_ns | ing_y_st | ing_y_ns | ing_z_st | ing_z_ns | |
|---|---|---|---|---|---|---|---|---|
| 0 | fid_63700 | fid_82299 | fid_63700 | fid_40569 | fid_63700 | fid_122174 | fid_63700 | fid_48766 |
| 1 | fid_68761 | fid_23693 | fid_68761 | fid_91594 | fid_103446 | fid_46342 | fid_68761 | fid_33276 |
| 2 | fid_103446 | fid_91594 | fid_103446 | fid_27582 | fid_109814 | fid_27773 | fid_103446 | fid_99942 |
| 3 | fid_109814 | fid_102220 | fid_109814 | fid_94240 | fid_117459 | fid_54695 | fid_109814 | fid_69285 |
| 4 | fid_122174 | fid_40785 | fid_122174 | fid_35661 | fid_47796 | fid_118135 | fid_73431 | fid_25582 |
ing_z_st_list=sets['ing_z_ns'].value_counts().to_frame().reset_index()['index'].to_list()
df=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params_p':[],'Params_q':[],'Train_error':[],'Test_error':[]})
for i in range(0,4):
df1=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params_p':[],'Params_q':[],'Train_error':[],'Test_error':[]})
fid=random.choices(ing_z_st_list, k=1)
fid=fid[0]
data1=train_data[(train_data['farm_id']==fid)&(train_data['ingredient_type']=='ing_z')]
data1=data1.groupby(by=['date','ingredient_type'],as_index=False).sum()
data1_pivot=data1.pivot(index='date',columns='ingredient_type',values='yield')
data1_pivot.columns.name=None
new_date=pd.date_range('2016-01-01 00:00:00','2016-12-31 23:00:00',freq='H')
data2=data1_pivot.reindex(new_date,fill_value=0)
data2.fillna(0,inplace=True)
ingw=data2[['ing_z']]
train=ingw['2016-01':'2016-10']
test=ingw['2016-11':'2016-12']
for i in range(0,3):
p=random.randint(1,4)
q=random.randint(1,4)
model = sm.tsa.ARMA(train['ing_w'], (p, q)).fit(trend='nc', disp=0)
train['predictions']=model.predict(start='2016-01-01 00:00:00',end='2016-10-31 23:00:00')
test['predictions']=model.predict(start='2016-11-01 00:00:00',end='2016-12-31 23:00:00')
tr_er=np.sqrt(mse(train['ing_w'],train['predictions']))
te_er=np.sqrt(mse(test['ing_w'],test['predictions']))
df1['Model_Name']=['ARMA']
df1['Farm_id']=[fid]
df1['Params_p']=[p]
df1['Params_q']=[q]
df1['Train_error']=tr_er
df1['Test_error']=te_er
#new['Feedback']=feedback
#df=pd.concat([df,new],axis=0)
#model_name='Arma'
#datafr=best_params_model('arma',fid,ingw,train,test,params)
#datafr=best_params_model("ARMA",fid,ingw,train,test,params)
df=pd.concat([df,df1],axis=0)
df
| Model_Name | Farm_id | Params_p | Params_q | Train_error | Test_error | |
|---|---|---|---|---|---|---|
| 0 | ARMA | fid_63064 | 1.0 | 4.0 | 1.152278 | 12.522500 |
| 0 | ARMA | fid_63064 | 3.0 | 3.0 | 1.145259 | 7.954494 |
| 0 | ARMA | fid_63064 | 3.0 | 2.0 | 1.152273 | 12.522528 |
| 0 | ARMA | fid_63064 | 2.0 | 3.0 | 1.152273 | 12.522530 |
| 0 | ARMA | fid_63064 | 1.0 | 1.0 | 1.152283 | 12.522500 |
| 0 | ARMA | fid_71919 | 4.0 | 3.0 | 0.930148 | 1.907023 |
| 0 | ARMA | fid_71919 | 2.0 | 4.0 | 0.932010 | 1.583472 |
| 0 | ARMA | fid_71919 | 2.0 | 4.0 | 0.932010 | 1.583472 |
| 0 | ARMA | fid_71919 | 2.0 | 4.0 | 0.932010 | 1.583472 |
| 0 | ARMA | fid_71919 | 3.0 | 4.0 | 0.932000 | 1.586006 |
| 0 | ARMA | fid_80428 | 2.0 | 2.0 | 7.999145 | 75.429051 |
| 0 | ARMA | fid_80428 | 2.0 | 2.0 | 7.999145 | 75.429051 |
| 0 | ARMA | fid_80428 | 1.0 | 1.0 | 8.107576 | 72.925692 |
| 0 | ARMA | fid_80428 | 1.0 | 2.0 | 8.009224 | 74.721480 |
| 0 | ARMA | fid_80428 | 1.0 | 2.0 | 8.009224 | 74.721480 |
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Generate the data to plot
x = df['Params_p']
y = df['Params_q']
z = df['Test_error']-df['Train_error']
# Create a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)
# Set the axis labels
ax.set_xlabel('Params_p')
ax.set_ylabel('Params_q')
ax.set_zlabel('Z')
# Show the plot
plt.show()
train_data['date']=pd.to_datetime(train_data['date'],format='%Y%m%d %H:%M:%S')
df=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params_p':[],'Params_d':[],'Params_q':[],'Train_error':[],'Test_error':[]})
for i in range(0,4):
df1=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params_p':[],'Params_d':[],'Params_q':[],'Train_error':[],'Test_error':[]})
fid=random.choices(ing_z_st_list, k=1)
fid=fid[0]
data1=train_data[(train_data['farm_id']==fid)&(train_data['ingredient_type']=='ing_z')]
data1=data1.groupby(by=['date','ingredient_type'],as_index=False).sum()
data1_pivot=data1.pivot(index='date',columns='ingredient_type',values='yield')
data1_pivot.columns.name=None
new_date=pd.date_range('2016-01-01 00:00:00','2016-12-31 23:00:00',freq='H')
data2=data1_pivot.reindex(new_date,fill_value=0)
data2.fillna(0,inplace=True)
ingw=data2[['ing_z']]
train=ingw['2016-01':'2016-10']
test=ingw['2016-11':'2016-12']
for i in range(0,3):
p=random.randint(1,4)
q=random.randint(1,4)
d=random.randint(1,2)
model = sm.tsa.statespace.SARIMAX(train['ing_z'], order=(p,d,q), seasonal_order=(1, 1, 1, 6),initialization='approximate_diffuse').fit()
train['predictions']=model.predict(start='2016-01-01 00:00:00',end='2016-10-31 23:00:00')
test['predictions']=model.predict(start='2016-11-01 00:00:00',end='2016-12-31 23:00:00')
tr_er=np.sqrt(mse(train['ing_z'],train['predictions']))
te_er=np.sqrt(mse(test['ing_z'],test['predictions']))
df1['Model_Name']=['ARIMA']
df1['Farm_id']=[fid]
df1['Params_p']=[p]
df1['Params_d']=[d]
df1['Params_q']=[q]
df1['Train_error']=tr_er
df1['Test_error']=te_er
#new['Feedback']=feedback
#df=pd.concat([df,new],axis=0)
#model_name='Arma'
#datafr=best_params_model('arma',fid,ingw,train,test,params)
#datafr=best_params_model("ARMA",fid,ingw,train,test,params)
df=pd.concat([df,df1],axis=0)
df['Error']=df['Test_error']-df['Train_error']
df.sort_values(by='Error',inplace=True)
df
| Model_Name | Farm_id | Params_p | Params_d | Params_q | Train_error | Test_error | Error | |
|---|---|---|---|---|---|---|---|---|
| 0 | ARMA | fid_78330 | 3.0 | 2.0 | 1.0 | 0.109978 | 0.373523 | 0.263545 |
| 0 | ARMA | fid_78330 | 1.0 | 2.0 | 4.0 | 0.109772 | 0.389416 | 0.279644 |
| 0 | ARMA | fid_78330 | 4.0 | 2.0 | 2.0 | 0.109923 | 0.406197 | 0.296274 |
| 0 | ARMA | fid_78330 | 3.0 | 2.0 | 2.0 | 0.109962 | 0.433501 | 0.323539 |
| 0 | ARMA | fid_78330 | 1.0 | 2.0 | 2.0 | 0.114799 | 0.499249 | 0.384450 |
| 0 | ARMA | fid_78330 | 4.0 | 1.0 | 3.0 | 0.108824 | 0.564695 | 0.455870 |
| 0 | ARMA | fid_78330 | 2.0 | 1.0 | 1.0 | 0.109238 | 0.585253 | 0.476015 |
| 0 | ARMA | fid_78330 | 4.0 | 1.0 | 4.0 | 0.108633 | 0.616994 | 0.508361 |
| 0 | ARMA | fid_64015 | 4.0 | 2.0 | 2.0 | 2.490549 | 6.550633 | 4.060084 |
| 0 | ARMA | fid_64015 | 2.0 | 1.0 | 3.0 | 2.400612 | 6.772879 | 4.372267 |
| 0 | ARMA | fid_64015 | 1.0 | 1.0 | 4.0 | 2.400548 | 6.786533 | 4.385985 |
| 0 | ARMA | fid_64015 | 3.0 | 1.0 | 3.0 | 2.373737 | 6.946760 | 4.573023 |
| 0 | ARMA | fid_64709 | 1.0 | 2.0 | 4.0 | 2.299792 | 8.302391 | 6.002599 |
| 0 | ARMA | fid_64709 | 2.0 | 1.0 | 2.0 | 2.253893 | 9.231946 | 6.978053 |
| 0 | ARMA | fid_64709 | 1.0 | 1.0 | 4.0 | 2.268672 | 10.043716 | 7.775044 |
| 0 | ARMA | fid_64709 | 3.0 | 1.0 | 4.0 | 2.266379 | 11.028489 | 8.762110 |
| 0 | ARMA | fid_64709 | 1.0 | 1.0 | 2.0 | 2.279733 | 11.801799 | 9.522066 |
| 0 | ARMA | fid_64709 | 4.0 | 1.0 | 4.0 | 2.212300 | 11.774733 | 9.562433 |
| 0 | ARMA | fid_64709 | 2.0 | 2.0 | 1.0 | 2.295991 | 12.318157 | 10.022166 |
| 0 | ARMA | fid_64709 | 4.0 | 1.0 | 1.0 | 2.157667 | 13.880594 | 11.722927 |
| 0 | ARMA | fid_68792 | 2.0 | 1.0 | 3.0 | 34.778096 | 141.721553 | 106.943457 |
| 0 | ARMA | fid_68792 | 3.0 | 1.0 | 2.0 | 34.687921 | 143.718361 | 109.030440 |
| 0 | ARMA | fid_68792 | 3.0 | 1.0 | 2.0 | 34.687921 | 143.718361 | 109.030440 |
| 0 | ARMA | fid_68792 | 4.0 | 1.0 | 2.0 | 35.830684 | 161.467075 | 125.636391 |
| 0 | ARMA | fid_68792 | 3.0 | 2.0 | 2.0 | 36.219623 | 242.819826 | 206.600203 |
| 0 | ARMA | fid_68792 | 3.0 | 2.0 | 4.0 | 36.815094 | 244.695977 | 207.880884 |
| 0 | ARMA | fid_68792 | 1.0 | 2.0 | 1.0 | 36.521020 | 656.774705 | 620.253685 |
| 0 | ARMA | fid_68792 | 4.0 | 2.0 | 1.0 | 36.204902 | 673.163597 | 636.958695 |
#df=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params':[],'Train_error':[],'Test_error':[]})
def best_params_model(model_name,farm_id,data,train,test,params):
#df=pd.DataFrame({'Model_Name':[],'Farm_id':[],'Params':[],'Train_error':[],'Test_error':[]})
fig, axes = plt.subplots(2,2, figsize=(15, 8))
fig.suptitle('Time series Plots')
print("---------------Time_Series_Plots_For{}-------------".format(farm_id))
#For data
ax1=data.plot(ax=axes[0,0])
ax1.set_title('data')
#For acf
ax1=sm.tsa.graphics.plot_acf(data,ax=axes[0,1],zero=False)
#For pacf
ax1=sm.tsa.graphics.plot_pacf(data,ax=axes[1,0], zero = False)
axes[1,1].set_axis_off()
plt.show()
print('-------------------Prediction_plots_For{}-------------------'.format(farm_id))
fig, axes = plt.subplots(2,2, figsize=(15, 8))
fig.suptitle('Prediction_plots')
#For train data
ax1=train.plot(ax=axes[0,0])
ax1.set_title('Train_data')
#For test data
ax1=test.plot(ax=axes[0,1])
ax1.set_title('Test_data')
tr_er=np.sqrt(mse(train['ing_w'],train['predictions']))
te_er=np.sqrt(mse(test['ing_w'],test['predictions']))
error=[tr_er,te_er]
##resid plot???
x_axis=['Train_error','Test_error']
sns.barplot(ax=axes[1,0],x=x_axis,y=error)
axes[1,1].set_axis_off()
plt.show()
#a=input("Enter some description based on above plots")
#new=pd.DataFrame()
#df['Model_Name']=model_name
#df['Farm_id']=farm_id
#df['Params']=params
#df['Train_error']=tr_er
#df['Test_error']=te_er
#new['Feedback']=feedback
#df=pd.concat([df,new],axis=0)
#return df
df=pd.DataFrame({'ingredient_type':[],'stationary type':[],'model_type':[],'parameters':[]})
df['ingredient_type']=['ing_w','ing_w','ing_x','ing_x','ing_y','ing_y','ing_z','ing_z']
df['model_type']='SARIMAX'
df['stationary type']=['stationary','Non stationary','stationary','Non stationary','stationary','Non stationary','stationary','Non stationary']
df['parameters']=['(p,d,q)=(3,1,3)','(p,d,q)=(3,4,4)','(p,d,q)=(2,1,4)','(p,d,q)=(3,4,4)','(p,d,q)=(3,1,3)','(p,d,q)=(3,4,4)','(p,d,q)=(4,1,2)','(p,d,q)=(3,2,1)']
df
| ingredient_type | stationary type | model_type | parameters | |
|---|---|---|---|---|
| 0 | ing_w | stationary | SARIMAX | (p,d,q)=(3,1,3) |
| 1 | ing_w | Non stationary | SARIMAX | (p,d,q)=(3,4,4) |
| 2 | ing_x | stationary | SARIMAX | (p,d,q)=(2,1,4) |
| 3 | ing_x | Non stationary | SARIMAX | (p,d,q)=(3,4,4) |
| 4 | ing_y | stationary | SARIMAX | (p,d,q)=(3,1,3) |
| 5 | ing_y | Non stationary | SARIMAX | (p,d,q)=(3,4,4) |
| 6 | ing_z | stationary | SARIMAX | (p,d,q)=(4,1,2) |
| 7 | ing_z | Non stationary | SARIMAX | (p,d,q)=(3,2,1) |
x=df[['Params_p', 'Params_d','Params_q']]
x.shape
y=df['Test_error']-df['Train_error']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
regressor = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
regressor.fit(x_train,y_train)
RandomForestRegressor(max_depth=10, random_state=42)
train_preds
array([ 23.65220299, 20.84756862, 19.89609658, 118.93602099,
52.2463295 , 484.17530705, 93.85038606, 170.76387559,
176.41661862, 7.9447807 , 21.06533915, 424.73651662,
7.29654588, 29.62921498, 199.6675215 , 40.1111367 ,
42.7607878 , 91.10943009, 51.16853619, 7.9447807 ,
7.29654588, 40.1111367 ])
y_train
0 6.978053 0 9.522066 0 0.455870 0 0.296274 0 0.476015 0 636.958695 0 125.636391 0 207.880884 0 206.600203 0 9.562433 0 8.762110 0 620.253685 0 4.385985 0 4.573023 0 10.022166 0 4.372267 0 6.002599 0 109.030440 0 0.384450 0 0.508361 0 7.775044 0 106.943457 dtype: float64
train_preds=regressor.predict(x_train)
test_preds=regressor.predict(x_test)
y_test
0 0.263545 0 4.060084 0 0.279644 0 11.722927 0 109.030440 0 0.323539 dtype: float64
test_preds
array([265.63244013, 118.93602099, 42.7607878 , 266.64120366,
91.10943009, 176.41661862])
len(ing_y_st_list)
300
fid=ing_y_st_list[0:100]
len(fid)
100
train_data=pd.read_csv('/content/drive/MyDrive/Final_capstone/train_data/train_data.csv')
train_data['date']=pd.to_datetime(train_data['date'],format='%Y%m%d %H:%M:%S')
%%time
df=pd.DataFrame({'date':[],'predictions':[],'farm_id':[],'ingredient_type':[]})
s=0
for i in fid:
data1=train_data[train_data['farm_id']==i][['date','ingredient_type','yield']]
#date conversion done above
data1=data1.groupby(by=['date','ingredient_type'],as_index=False).sum()
data1_pivot=data1.pivot(index='date',columns='ingredient_type',values='yield')
data1_pivot.columns.name=None
new_date=pd.date_range('2016-01-01 00:00:00','2016-12-31 23:00:00',freq='H')
data2=data1_pivot.reindex(new_date,fill_value=0)
data2.fillna(0,inplace=True)
ingy=data2[['ing_y']]
model = sm.tsa.statespace.SARIMAX(ingy, order=(3,1,4), seasonal_order=(1, 1, 1, 8),initialization='approximate_diffuse').fit()
preds=model.predict(start='2017-01-01 00:00:00',end='2017-12-31 23:00:00')
df1=preds.to_frame()
df1['farm_id']=i
df1.reset_index(inplace=True)
df1.rename(columns={'index':'date','predicted_mean':'predictions'},inplace=True)
df1['ingredient_type']='ing_y'
df=pd.concat([df,df1],axis=0)
s=s+1
print("model {} successful".format(s))
model 1 successful model 2 successful model 3 successful model 4 successful model 5 successful model 6 successful model 7 successful model 8 successful model 9 successful model 10 successful model 11 successful model 12 successful model 13 successful model 14 successful model 15 successful model 16 successful model 17 successful model 18 successful model 19 successful model 20 successful model 21 successful model 22 successful model 23 successful model 24 successful model 25 successful model 26 successful model 27 successful model 28 successful model 29 successful model 30 successful model 31 successful model 32 successful model 33 successful model 34 successful model 35 successful model 36 successful model 37 successful model 38 successful model 39 successful model 40 successful model 41 successful model 42 successful model 43 successful model 44 successful model 45 successful model 46 successful model 47 successful model 48 successful model 49 successful model 50 successful model 51 successful model 52 successful model 53 successful model 54 successful model 55 successful model 56 successful model 57 successful model 58 successful model 59 successful model 60 successful model 61 successful model 62 successful model 63 successful model 64 successful model 65 successful model 66 successful model 67 successful model 68 successful model 69 successful model 70 successful model 71 successful model 72 successful model 73 successful model 74 successful model 75 successful model 76 successful model 77 successful model 78 successful model 79 successful model 80 successful model 81 successful model 82 successful model 83 successful model 84 successful model 85 successful model 86 successful model 87 successful model 88 successful model 89 successful model 90 successful model 91 successful model 92 successful model 93 successful model 94 successful model 95 successful model 96 successful model 97 successful model 98 successful model 99 successful model 100 successful CPU times: user 2h 1min 1s, sys: 3min 43s, total: 2h 4min 44s Wall time: 2h 2min 39s
df
| date | predictions | farm_id | ingredient_type | |
|---|---|---|---|---|
| 0 | 2017-01-01 00:00:00 | 54.119342 | fid_63700 | ing_y |
| 1 | 2017-01-01 01:00:00 | 57.542113 | fid_63700 | ing_y |
| 2 | 2017-01-01 02:00:00 | 52.571918 | fid_63700 | ing_y |
| 3 | 2017-01-01 03:00:00 | 51.281324 | fid_63700 | ing_y |
| 4 | 2017-01-01 04:00:00 | 52.237634 | fid_63700 | ing_y |
| ... | ... | ... | ... | ... |
| 8755 | 2017-12-31 19:00:00 | 239.374251 | fid_51284 | ing_y |
| 8756 | 2017-12-31 20:00:00 | 237.397279 | fid_51284 | ing_y |
| 8757 | 2017-12-31 21:00:00 | 243.784860 | fid_51284 | ing_y |
| 8758 | 2017-12-31 22:00:00 | 243.875725 | fid_51284 | ing_y |
| 8759 | 2017-12-31 23:00:00 | 235.446901 | fid_51284 | ing_y |
876000 rows × 4 columns
df[df['predictions']<0]
| date | predictions | farm_id | ingredient_type | |
|---|---|---|---|---|
| 197 | 2017-01-09 05:00:00 | -0.058046 | fid_40771 | ing_y |
| 205 | 2017-01-09 13:00:00 | -0.146759 | fid_40771 | ing_y |
| 213 | 2017-01-09 21:00:00 | -0.235472 | fid_40771 | ing_y |
| 221 | 2017-01-10 05:00:00 | -0.324185 | fid_40771 | ing_y |
| 229 | 2017-01-10 13:00:00 | -0.412898 | fid_40771 | ing_y |
| ... | ... | ... | ... | ... |
| 8755 | 2017-12-31 19:00:00 | -3077.201929 | fid_15805 | ing_y |
| 8756 | 2017-12-31 20:00:00 | -3084.137428 | fid_15805 | ing_y |
| 8757 | 2017-12-31 21:00:00 | -3060.701861 | fid_15805 | ing_y |
| 8758 | 2017-12-31 22:00:00 | -2993.025073 | fid_15805 | ing_y |
| 8759 | 2017-12-31 23:00:00 | -2841.740191 | fid_15805 | ing_y |
91969 rows × 4 columns
df.to_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_y_st/ing_y_st_negvals/1to100farmid.csv",index=False)
df
| date | predictions | farm_id | ingredient_type | |
|---|---|---|---|---|
| 0 | 2017-01-01 00:00:00 | 54.119342 | fid_63700 | ing_y |
| 1 | 2017-01-01 01:00:00 | 57.542113 | fid_63700 | ing_y |
| 2 | 2017-01-01 02:00:00 | 52.571918 | fid_63700 | ing_y |
| 3 | 2017-01-01 03:00:00 | 51.281324 | fid_63700 | ing_y |
| 4 | 2017-01-01 04:00:00 | 52.237634 | fid_63700 | ing_y |
| ... | ... | ... | ... | ... |
| 8755 | 2017-12-31 19:00:00 | 239.374251 | fid_51284 | ing_y |
| 8756 | 2017-12-31 20:00:00 | 237.397279 | fid_51284 | ing_y |
| 8757 | 2017-12-31 21:00:00 | 243.784860 | fid_51284 | ing_y |
| 8758 | 2017-12-31 22:00:00 | 243.875725 | fid_51284 | ing_y |
| 8759 | 2017-12-31 23:00:00 | 235.446901 | fid_51284 | ing_y |
876000 rows × 4 columns
df1=df.copy()
df1['predictions']=df1['predictions'].apply(lambda x: 0 if x<0 else x)
df1
| date | predictions | farm_id | ingredient_type | |
|---|---|---|---|---|
| 0 | 2017-01-01 00:00:00 | 54.119342 | fid_63700 | ing_y |
| 1 | 2017-01-01 01:00:00 | 57.542113 | fid_63700 | ing_y |
| 2 | 2017-01-01 02:00:00 | 52.571918 | fid_63700 | ing_y |
| 3 | 2017-01-01 03:00:00 | 51.281324 | fid_63700 | ing_y |
| 4 | 2017-01-01 04:00:00 | 52.237634 | fid_63700 | ing_y |
| ... | ... | ... | ... | ... |
| 8755 | 2017-12-31 19:00:00 | 239.374251 | fid_51284 | ing_y |
| 8756 | 2017-12-31 20:00:00 | 237.397279 | fid_51284 | ing_y |
| 8757 | 2017-12-31 21:00:00 | 243.784860 | fid_51284 | ing_y |
| 8758 | 2017-12-31 22:00:00 | 243.875725 | fid_51284 | ing_y |
| 8759 | 2017-12-31 23:00:00 | 235.446901 | fid_51284 | ing_y |
876000 rows × 4 columns
df1[df1['predictions']<0]
| date | predictions | farm_id | ingredient_type |
|---|
df1.to_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_y_st/ing_y_st_noneg_Vals/1to100farmid.csv",index=False)
ing_w_st=pd.read_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_w_st/ingw_total_preds.csv")
ing_x_st=pd.read_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_x_st/ing_x__total.csv")
ing_y_st=pd.read_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_y_st/ingy_total_preds.csv")
ing_z_st=pd.read_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_z_st/ingz_total_preds.csv")
ing_all_ns=pd.read_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/ing_all_ns/allingwxyz_ns.csv")
ing_ns_all=ing_all_ns[['date','predictions','farm_id','ingredient_type']]
preds=pd.concat([ing_w_st,ing_x_st,ing_y_st,ing_z_st,ing_ns_all],axis=0)
preds_list=preds['farm_id'].value_counts().to_frame().reset_index()['index'].to_list()
test_data=pd.read_csv("/content/drive/MyDrive/Final_capstone/test_data/test_data-1664552867678.csv")
test_data.head()
test_list=test_data['farm_id'].value_counts().to_frame().reset_index()['index'].to_list()
total_merge=test_data.merge(preds,on=['date','farm_id','ingredient_type'])
total_merge.shape
total_merge.head()
total_merge.sort_values(by='id',inplace=True)
total_merge.head()
sample_sub=pd.read_csv("/content/drive/MyDrive/Final_capstone/sample_submission-1646898326432-1664166963268.csv")
sample_sub.head()
sub=total_merge[['id','predictions']]
sub.rename(columns={'predictions':'yield'},inplace=True)
sub.shape
sub.to_csv("/content/drive/MyDrive/Final_capstone/Forecast_2017_Sarimax/4632_submission1.csv",index=False)
!pip install statsforecast
!pip install statsmodels
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: statsforecast in /usr/local/lib/python3.8/dist-packages (1.4.0) Requirement already satisfied: numba>=0.55.0 in /usr/local/lib/python3.8/dist-packages (from statsforecast) (0.56.4) Requirement already satisfied: statsmodels>=0.13.2 in /usr/local/lib/python3.8/dist-packages (from statsforecast) (0.13.5) Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from statsforecast) (3.5.3) Requirement already satisfied: plotly in /usr/local/lib/python3.8/dist-packages (from statsforecast) (5.5.0) Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.8/dist-packages (from statsforecast) (1.3.5) Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from statsforecast) (4.64.1) Requirement already satisfied: scipy>=1.7.3 in /usr/local/lib/python3.8/dist-packages (from statsforecast) (1.7.3) Requirement already satisfied: numpy>=1.21.6 in /usr/local/lib/python3.8/dist-packages (from statsforecast) (1.22.4) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.8/dist-packages (from numba>=0.55.0->statsforecast) (6.0.0) Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba>=0.55.0->statsforecast) (0.39.1) Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from numba>=0.55.0->statsforecast) (57.4.0) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.3.5->statsforecast) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=1.3.5->statsforecast) (2022.7.1) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-packages (from statsmodels>=0.13.2->statsforecast) (23.0) Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-packages (from statsmodels>=0.13.2->statsforecast) (0.5.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->statsforecast) (0.11.0) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib->statsforecast) (7.1.2) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->statsforecast) (1.4.4) Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->statsforecast) (3.0.9) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib->statsforecast) (4.38.0) Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from plotly->statsforecast) (1.15.0) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from plotly->statsforecast) (8.2.1) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata->numba>=0.55.0->statsforecast) (3.14.0) Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: statsmodels in /usr/local/lib/python3.8/dist-packages (0.13.5) Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (0.5.3) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (23.0) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.22.4) Requirement already satisfied: scipy>=1.3 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.7.3) Requirement already satisfied: pandas>=0.25 in /usr/local/lib/python3.8/dist-packages (from statsmodels) (1.3.5) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.25->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/dist-packages (from pandas>=0.25->statsmodels) (2022.7.1) Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from patsy>=0.5.2->statsmodels) (1.15.0)
##Loading packages
from google.colab import drive
drive.mount('/content/drive')
from statsforecast.core import StatsForecast
from statsforecast.models import (ADIDA,CrostonClassic)
#from fbprophet import Prophet
#from fancyimpute import KNN
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
farm_data
| farm_id | operations_commencing_year | num_processing_plants | farm_area | farming_company | deidentified_location | |
|---|---|---|---|---|---|---|
| 0 | fid_110884 | 2008.0 | NaN | 690.455096 | Obery Farms | location 7369 |
| 1 | fid_90053 | 2004.0 | NaN | 252.696160 | Obery Farms | location 7369 |
| 2 | fid_17537 | 1991.0 | NaN | 499.446528 | Obery Farms | location 7369 |
| 3 | fid_110392 | 2002.0 | NaN | 2200.407555 | Obery Farms | location 7369 |
| 4 | fid_62402 | 1975.0 | NaN | 10833.140121 | Obery Farms | location 7369 |
| ... | ... | ... | ... | ... | ... | ... |
| 1444 | fid_41664 | 1914.0 | NaN | 1822.663957 | Sanderson Farms | location 5150 |
| 1445 | fid_14529 | NaN | NaN | 399.297094 | Obery Farms | location 5150 |
| 1446 | fid_46454 | 1997.0 | NaN | 1046.552295 | Sanderson Farms | location 5150 |
| 1447 | fid_24103 | 2001.0 | NaN | 2766.186825 | Dole Food Company | location 5150 |
| 1448 | fid_34139 | 2001.0 | NaN | 8572.252713 | Wayne Farms | location 5150 |
1449 rows × 6 columns
repe=farm_data['farm_id'].value_counts().to_frame().head(15).reset_index()['index'].to_list()
##15 farm_ids are repeated twice i.e farming company and deidentified_location are different
cols=total_data.columns.to_list()
df=pd.DataFrame(columns=cols)
for i in repe:
df1=total_data[total_data['farm_id']==i]
df=pd.concat([df,df1],axis=0)
df
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 749 | 2016-01-01 00:00:00 | fid_29387 | ing_w | 0.1625 | location 7048 | 1.9 | 200.0 | -1.2 | 1016.2 | -1.0 | 5.0 | 2590.971767 | Wayne Farms |
| 750 | 2016-01-01 00:00:00 | fid_29387 | ing_w | 131.4450 | location 7048 | 1.9 | 200.0 | -1.2 | 1016.2 | -1.0 | 5.0 | 2590.971767 | Wayne Farms |
| 2031 | 2016-01-01 01:00:00 | fid_29387 | ing_w | 0.1525 | location 5833 | 9.4 | 360.0 | -2.2 | 1021.4 | 0.0 | 3.1 | 371.612000 | Obery Farms |
| 2032 | 2016-01-01 01:00:00 | fid_29387 | ing_w | 0.1525 | location 7048 | 1.6 | 190.0 | -1.5 | 1016.3 | 0.0 | 3.0 | 2590.971767 | Wayne Farms |
| 2033 | 2016-01-01 01:00:00 | fid_29387 | ing_w | 132.0550 | location 5833 | 9.4 | 360.0 | -2.2 | 1021.4 | 0.0 | 3.1 | 371.612000 | Obery Farms |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20509364 | 2016-12-31 23:00:00 | fid_68761 | ing_x | 11.4649 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 3174.588413 | Obery Farms |
| 20509365 | 2016-12-31 23:00:00 | fid_68761 | ing_w | 13.9300 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 2173.186976 | Wayne Farms |
| 20509366 | 2016-12-31 23:00:00 | fid_68761 | ing_w | 13.9300 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 3174.588413 | Obery Farms |
| 20509367 | 2016-12-31 23:00:00 | fid_68761 | ing_z | 5.8614 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 2173.186976 | Wayne Farms |
| 20509368 | 2016-12-31 23:00:00 | fid_68761 | ing_z | 5.8614 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 3174.588413 | Obery Farms |
771408 rows × 13 columns
#total_data=train_data.merge(farm_wea,on=['farm_id','date'])
#total_data.to_csv("/content/drive/MyDrive/Final_capstone/train_data/total_data.csv",index=False)
total_data=pd.read_csv("/content/drive/MyDrive/Final_capstone/train_data/total_data.csv")
total_data.head()
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_110884 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 690.455096 | Obery Farms |
| 1 | 2016-01-01 00:00:00 | fid_90053 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 252.696160 | Obery Farms |
| 2 | 2016-01-01 00:00:00 | fid_17537 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 499.446528 | Obery Farms |
| 3 | 2016-01-01 00:00:00 | fid_110392 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 2200.407555 | Obery Farms |
| 4 | 2016-01-01 00:00:00 | fid_62402 | ing_w | 0.0 | location 7369 | 25.0 | 0.0 | 20.0 | 1019.7 | 0.0 | 0.0 | 10833.140121 | Obery Farms |
total_data.shape
(20511532, 13)
total_data.isnull().sum()
date 0 farm_id 0 ingredient_type 0 yield 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_area 0 farming_company 0 dtype: int64
ingz=total_data[total_data['ingredient_type']=='ing_z']
ingz.shape
(1287747, 13)
ingz.head()
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 2016-01-01 00:00:00 | fid_87942 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 499.260722 | Obery Farms |
| 110 | 2016-01-01 00:00:00 | fid_66870 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 5295.006485 | Obery Farms |
| 114 | 2016-01-01 00:00:00 | fid_66062 | ing_z | 96.978 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 2992.034018 | Obery Farms |
| 116 | 2016-01-01 00:00:00 | fid_75323 | ing_z | 19.597 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 9334.986343 | Obery Farms |
| 118 | 2016-01-01 00:00:00 | fid_75397 | ing_z | 100.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 12976.969749 | Obery Farms |
ingz.isnull().sum()
date 0 farm_id 0 ingredient_type 0 yield 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_area 0 farming_company 0 dtype: int64
#ingz.drop(['deidentified_location','farming_company'],axis=1,inplace=True)
ingz.head()
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 2016-01-01 00:00:00 | fid_87942 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 499.260722 | Obery Farms |
| 110 | 2016-01-01 00:00:00 | fid_66870 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 5295.006485 | Obery Farms |
| 114 | 2016-01-01 00:00:00 | fid_66062 | ing_z | 96.978 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 2992.034018 | Obery Farms |
| 116 | 2016-01-01 00:00:00 | fid_75323 | ing_z | 19.597 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 9334.986343 | Obery Farms |
| 118 | 2016-01-01 00:00:00 | fid_75397 | ing_z | 100.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 12976.969749 | Obery Farms |
ingz_farmid=ingz['farm_id'].value_counts().to_frame().reset_index()['index'].to_list()
ingz[ingz['farm_id']==ingz_farmid[0]]
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 455 | 2016-01-01 00:00:00 | fid_73431 | ing_z | 0.0 | location 5489 | 15.6 | 270.0 | -5.6 | 1015.3 | -1.0 | 3.6 | 3132.038839 | Wayne Farms |
| 456 | 2016-01-01 00:00:00 | fid_73431 | ing_z | 0.0 | location 2532 | -8.3 | 250.0 | -12.2 | 1023.3 | -1.0 | 4.1 | 5630.479218 | Obery Farms |
| 1667 | 2016-01-01 01:00:00 | fid_73431 | ing_z | 0.0 | location 5489 | 13.9 | 270.0 | -5.6 | 1015.6 | 0.0 | 4.1 | 3132.038839 | Wayne Farms |
| 1668 | 2016-01-01 01:00:00 | fid_73431 | ing_z | 0.0 | location 2532 | -8.3 | 240.0 | -11.7 | 1023.3 | -1.0 | 4.1 | 5630.479218 | Obery Farms |
| 4005 | 2016-01-01 02:00:00 | fid_73431 | ing_z | 0.0 | location 5489 | 13.3 | 270.0 | -5.6 | 1016.0 | 0.0 | 3.1 | 3132.038839 | Wayne Farms |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20504784 | 2016-12-31 21:00:00 | fid_73431 | ing_z | 0.0 | location 2532 | -4.4 | 240.0 | -11.7 | 1009.4 | 0.0 | 4.1 | 5630.479218 | Obery Farms |
| 20507193 | 2016-12-31 22:00:00 | fid_73431 | ing_z | 0.0 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 3.6 | 3132.038839 | Wayne Farms |
| 20507194 | 2016-12-31 22:00:00 | fid_73431 | ing_z | 0.0 | location 2532 | -5.0 | 230.0 | -11.7 | 1010.0 | 0.0 | 4.6 | 5630.479218 | Obery Farms |
| 20509604 | 2016-12-31 23:00:00 | fid_73431 | ing_z | 0.0 | location 5489 | 16.7 | 250.0 | 10.0 | 1011.0 | 0.0 | 2.6 | 3132.038839 | Wayne Farms |
| 20509605 | 2016-12-31 23:00:00 | fid_73431 | ing_z | 0.0 | location 2532 | -5.6 | 240.0 | -11.1 | 1009.8 | 0.0 | 3.6 | 5630.479218 | Obery Farms |
17564 rows × 13 columns
#To find paterner ship farms
repe
['fid_29387', 'fid_68792', 'fid_73431', 'fid_40459', 'fid_59158', 'fid_81333', 'fid_53126', 'fid_26064', 'fid_97094', 'fid_63700', 'fid_71910', 'fid_18990', 'fid_122174', 'fid_54932', 'fid_68761']
ingz_finalfid=[]
repe_ingz=[]
for i in ingz_farmid:
if i in repe:
repe_ingz.append(i)
else:
ingz_finalfid.append(i)
repe_ingz
['fid_73431', 'fid_68761', 'fid_63700']
len(ingz_finalfid)
142
ingz
| date | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 106 | 2016-01-01 00:00:00 | fid_87942 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 499.260722 | Obery Farms |
| 110 | 2016-01-01 00:00:00 | fid_66870 | ing_z | 0.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 5295.006485 | Obery Farms |
| 114 | 2016-01-01 00:00:00 | fid_66062 | ing_z | 96.978 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 2992.034018 | Obery Farms |
| 116 | 2016-01-01 00:00:00 | fid_75323 | ing_z | 19.597 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 9334.986343 | Obery Farms |
| 118 | 2016-01-01 00:00:00 | fid_75397 | ing_z | 100.000 | location 959 | 3.8 | 240.0 | 2.4 | 1020.9 | -1.0 | 3.1 | 12976.969749 | Obery Farms |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20511280 | 2016-12-31 23:00:00 | fid_113277 | ing_z | 2476.610 | location 5410 | 6.1 | 190.0 | -6.7 | 1012.4 | -1.0 | 9.8 | 15467.327567 | Sanderson Farms |
| 20511283 | 2016-12-31 23:00:00 | fid_46846 | ing_z | 1909.330 | location 5410 | 6.1 | 190.0 | -6.7 | 1012.4 | -1.0 | 9.8 | 8101.141600 | Wayne Farms |
| 20511286 | 2016-12-31 23:00:00 | fid_26870 | ing_z | 364.019 | location 5410 | 6.1 | 190.0 | -6.7 | 1012.4 | -1.0 | 9.8 | 7867.769264 | Sanderson Farms |
| 20511288 | 2016-12-31 23:00:00 | fid_75152 | ing_z | 156.050 | location 5150 | 1.7 | 180.0 | -5.6 | 1008.5 | -1.0 | 8.8 | 14416.408831 | Sanderson Farms |
| 20511303 | 2016-12-31 23:00:00 | fid_46342 | ing_z | 8501.020 | location 5150 | 1.7 | 180.0 | -5.6 | 1008.5 | -1.0 | 8.8 | 15887.156224 | Obery Farms |
1287747 rows × 13 columns
ingz.columns
Index(['date', 'farm_id', 'ingredient_type', 'yield', 'deidentified_location',
'temp_obs', 'wind_direction', 'dew_temp', 'pressure_sea_level',
'precipitation', 'wind_speed', 'farm_area', 'farming_company'],
dtype='object')
len(ingz_finalfid)
142
#Preprocessing the data for statsforecast
ingz.isnull().sum()
date 0 farm_id 0 ingredient_type 0 yield 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_area 0 farming_company 0 dtype: int64
ingz['date']=pd.to_datetime(ingz['date'])
cols=['farm_id', 'ingredient_type', 'yield', 'deidentified_location','temp_obs', 'wind_direction', 'dew_temp', 'pressure_sea_level','precipitation', 'wind_speed', 'farm_area', 'farming_company']
df=pd.DataFrame(columns=cols)
new_date=pd.date_range(start='2016-01-01 00:00:00',end='2016-12-31 23:00:00',freq='H')
for i in ingz_finalfid:
a1=ingz[ingz['farm_id']==i]
a1.set_index('date',inplace=True)
a2=a1.reindex(new_date,fill_value=np.nan)
a2['farm_id']=a2['farm_id'].fillna(i)
fc=a2['farming_company'].value_counts().reset_index()['index'].to_list()
a2['farming_company']=a2['farming_company'].fillna(fc[0])
a2['yield']=a2['yield'].fillna(0)
loc=a2['deidentified_location'].value_counts().reset_index()['index'].to_list()
a2['deidentified_location']=a2['deidentified_location'].fillna(loc[0])
a2['ingredient_type']=a2['ingredient_type'].fillna('ing_z')
farm_area=a2['farm_area'].value_counts().reset_index()['index'].to_list()[0]
a2['farm_area']=a2['farm_area'].fillna(farm_area)
a2.interpolate(method='time',axis=0,inplace=True)
df=pd.concat([df,a2],axis=0)
df.interpolate(method='time',axis=0,inplace=True)
df.isnull().sum()
farm_id 0 ingredient_type 0 yield 0 deidentified_location 0 temp_obs 0 wind_direction 0 dew_temp 0 pressure_sea_level 0 precipitation 0 wind_speed 0 farm_area 0 farming_company 0 dtype: int64
df.head()
| farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 00:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 15.6 | 270.0 | -5.6 | 1015.3 | -1.0 | 3.6 | 10371.226405 | Obery Farms |
| 2016-01-01 01:00:00 | fid_34542 | ing_z | 17.5843 | location 5489 | 13.9 | 270.0 | -5.6 | 1015.6 | 0.0 | 4.1 | 10371.226405 | Obery Farms |
| 2016-01-01 02:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 13.3 | 270.0 | -5.6 | 1016.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 2016-01-01 03:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 12.2 | 280.0 | -6.1 | 1016.6 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 2016-01-01 04:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 11.7 | 270.0 | -6.7 | 1017.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
df[df['farm_id']=='fid_27285']
| farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-01-01 00:00:00 | fid_27285 | ing_z | 0.000 | location 1784 | 15.6 | 270.0 | -5.6 | 1015.3 | -1.0 | 3.6 | 7835.996438 | Obery Farms |
| 2016-01-01 01:00:00 | fid_27285 | ing_z | 1303.400 | location 1784 | -10.6 | 0.0 | -13.9 | 1036.7 | 0.0 | 0.0 | 7835.996438 | Obery Farms |
| 2016-01-01 02:00:00 | fid_27285 | ing_z | 1531.180 | location 1784 | -11.1 | 0.0 | -13.9 | 1036.9 | 0.0 | 0.0 | 7835.996438 | Obery Farms |
| 2016-01-01 03:00:00 | fid_27285 | ing_z | 1436.850 | location 1784 | -10.0 | 280.0 | -15.0 | 1036.9 | 0.0 | 2.6 | 7835.996438 | Obery Farms |
| 2016-01-01 04:00:00 | fid_27285 | ing_z | 1479.420 | location 1784 | -13.3 | 230.0 | -15.6 | 1037.8 | 0.0 | 1.5 | 7835.996438 | Obery Farms |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2016-12-31 19:00:00 | fid_27285 | ing_z | 100.296 | location 1784 | -4.4 | 60.0 | -7.2 | 1019.7 | 0.0 | 1.5 | 7835.996438 | Obery Farms |
| 2016-12-31 20:00:00 | fid_27285 | ing_z | 89.550 | location 1784 | -4.4 | 170.0 | -7.8 | 1018.8 | 0.0 | 2.1 | 7835.996438 | Obery Farms |
| 2016-12-31 21:00:00 | fid_27285 | ing_z | 88.356 | location 1784 | -4.4 | 0.0 | -7.2 | 1017.9 | 0.0 | 0.0 | 7835.996438 | Obery Farms |
| 2016-12-31 22:00:00 | fid_27285 | ing_z | 96.714 | location 1784 | -4.4 | 120.0 | -7.2 | 1017.8 | 0.0 | 2.1 | 7835.996438 | Obery Farms |
| 2016-12-31 23:00:00 | fid_27285 | ing_z | 91.938 | location 1784 | -5.0 | 310.0 | -7.8 | 1017.4 | 0.0 | 1.5 | 7835.996438 | Obery Farms |
8784 rows × 12 columns
df.shape
(1247328, 12)
df.reset_index(inplace=True)
df.head()
| index | farm_id | ingredient_type | yield | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 15.6 | 270.0 | -5.6 | 1015.3 | -1.0 | 3.6 | 10371.226405 | Obery Farms |
| 1 | 2016-01-01 01:00:00 | fid_34542 | ing_z | 17.5843 | location 5489 | 13.9 | 270.0 | -5.6 | 1015.6 | 0.0 | 4.1 | 10371.226405 | Obery Farms |
| 2 | 2016-01-01 02:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 13.3 | 270.0 | -5.6 | 1016.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 3 | 2016-01-01 03:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 12.2 | 280.0 | -6.1 | 1016.6 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 4 | 2016-01-01 04:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 11.7 | 270.0 | -6.7 | 1017.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
df.rename(columns={'index':'ds','farm_id':'unique_id','yield':'y'},inplace=True)
df.head()
| ds | unique_id | ingredient_type | y | deidentified_location | temp_obs | wind_direction | dew_temp | pressure_sea_level | precipitation | wind_speed | farm_area | farming_company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 15.6 | 270.0 | -5.6 | 1015.3 | -1.0 | 3.6 | 10371.226405 | Obery Farms |
| 1 | 2016-01-01 01:00:00 | fid_34542 | ing_z | 17.5843 | location 5489 | 13.9 | 270.0 | -5.6 | 1015.6 | 0.0 | 4.1 | 10371.226405 | Obery Farms |
| 2 | 2016-01-01 02:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 13.3 | 270.0 | -5.6 | 1016.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 3 | 2016-01-01 03:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 12.2 | 280.0 | -6.1 | 1016.6 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
| 4 | 2016-01-01 04:00:00 | fid_34542 | ing_z | 20.5150 | location 5489 | 11.7 | 270.0 | -6.7 | 1017.0 | 0.0 | 3.1 | 10371.226405 | Obery Farms |
y=df[['y','unique_id','ds']]
StatsForecast.plot(y)
df.columns
Index(['ds', 'unique_id', 'ingredient_type', 'y', 'deidentified_location',
'temp_obs', 'wind_direction', 'dew_temp', 'pressure_sea_level',
'precipitation', 'wind_speed', 'farm_area', 'farming_company'],
dtype='object')
x=df[['ds', 'unique_id','temp_obs', 'wind_direction', 'dew_temp', 'pressure_sea_level','precipitation', 'wind_speed', 'farm_area']]
StatsForecast.plot(x)